I have no doubt that a lot of garden-variety diagnoses and treatments can be done by an AI system that is fine-tuned and vetted to accomplish the task. I recently had to pay $93 to have a virtual session with a physician to get prescription for a cough syrup, which I already knew what to take before talking to her because I did some research/reading. Some may argue, "Doctors studied years in med school and you shouldn't trust Google more than them", but knowing human's fallibility and knowing that a lot of doctors do look things up on places like https://www.wolterskluwer.com/en/solutions/uptodate to refresh/reaffirm their knowledge, I'd argue that if we are willing to take the risk, why shouldn't we be allowed to take that risk on our own? Why do I have to pay $93 (on top of the cough syrup that costed ~$44) just so that the doctor can see me on Zoom for less than 5 mins and submit an order for the med?
With the healthcare prices increasing at the breakneck speed, I am sure AI will take more and more role in diagnosing and treating people's common illnesses, and hopefully (doubt it), the some of that savings will be transferred to the patients.
P.S. In contrast to the US system, in my home city (Rangoon, Burma/Myanmar), I have multiple clinics near my home and a couple of pharmacy within two bus stops distance. I can either go buy most of the medications I need from the pharmacy (without prescription) and take them on my own (why am I not allowed to take that risk?) OR I can go see a doctor at one of these clinics to confirm my diagnosis, pay him/her $10-$20 for the visit, and then head down to the pharmacy to buy the medication. Of course, some of the medications that include opioids will only be sold to me with the doctor's prescription, but a good number of other meds are available as long as I can afford them.
The only reason this worked is because your situation was exceedingly simple.
The trouble is you are not educated enough to tell what is simple and what isn't. A cough could be a cough or it could be something more serious, only a "real" examination will reveal that. And sometimes even that's not enough, you need an examination by a specialists.
I'll tell you a story. Once upon a time I got pain in my balls. I went to a doctor and he felt around and he said he didn't feel anything. I went to another doctor and he felt something, but he had no idea what it was. He said could be a cyst, could be a swollen vein, could be an infection - he didn't even know if it was on the testicle or on the tube thingy.
Then I went to a Urologist. You can tell this man has felt up a lot of balls. He felt me up and said, "yup, that's a tumor" almost immediately. He was right, of course, and he ended up being the one to remove it too. Since I caught the cancer pretty early the chemotherapy wasn't too intense.
Point is, expertise matters when things aren't straight forward. Then, experience and perspective gets to shine.
As a competing anecdote, last summer my toddler woke up from a nap with bruises on his legs and swelling around his joints. We initially thought maybe he was tangled up in his crib, but later when changing his diaper found his testicles were swollen and rushed him to the emergency room. Over the next 6-7 days we saw no less that 5 doctors including his pediatrician, orthopedic specialists, and doctors at the ER and urgent care. None of whom were able to give us any answers about this weird bruising and swelling that randomly appeared on his legs. After all of this, during one of his flair ups, I took a picture of his legs and searched it with Google lens. The results set included pictures and symptoms of Henoch-Schonlein purpura [0], that were identical to what he was experiencing. We confirmed this with his pediatrician and decided on a treatment plan, but I was floored at how many doctors we had to see (and how much money we had to spend), only for me to diagnose this on my own with Google lens.
I don't think this is the slam-dunk you think it is. You had to go to three doctors before someone recognized something was wrong. ChatGPT is the substitute for the first two doctors, not the third.
And even there, I bet ChatGPT would have told you to go see a doctor, since it can't feel your balls. And after your first appointment, if you had told it that you still thought something was wrong, it would probably have told you to go see a urologist.
>And even there, I bet ChatGPT would have told you to go see a doctor
So what is the value add of ChatGPT in this scenario? Either it says you don't need to see the doctor which could very well kill you or it tells you to visit the doctor and it doesn't actually help you in any way.
You're missing the case of it telling you not to see a doctor and it's right, in which you save a lot of time (and potentially money).
Mistakes will happen, that's the nature of statistics of this nature. But don't forget that saving 1,000s of people time is worth something, even if you never directly see it like you do the one case that it didn't catch.
The value-add would be getting my insurance to approve me to go see a "specialist" (the urologist) immediately rather than forcing me to schedule an appointment with my primary-care doctor first, paying $200 for the office visit, and delaying care by 2 weeks.
So, a technical solution to a bureaucratic problem.
Also, even when I'm on a PPO plan instead of HMO...I know when I have a yeast infection. I don't want to pay $100 to tell a doctor that I have a yeast infection. They don't even look at my genitals - they just take my word for it. But I can't buy fluconazole without a prescription. Such a waste of everyone's time and my money.
So again, a technical solution to a bureaucratic problem.
That isn’t value created by the AI, it is valued bestowed on it by your insurance company. They could bestow that same value on a Google search or a call center worker in Southeast Asia and it would provide you the same value.
The AI is just making an already convoluted system more convoluted.
Get a PPO plan, then? I can go directly to any specialist I want, although I find it’s better to always talk to my PCP first, who then makes a referral and often recommends the specialist she prefers.
(If your HMO charges $200 for a PCP visit - wow, I’d look for a different plan.)
My second example was specifically called out as an example of an issue even with PPO's. Also, not everyone (including myself) can easily change jobs or get insurance that simply isn't even offered by the employer. Worth noting that in my state for example, none of the public ACA health insurance options are a PPO.
I fall slightly on the “doomer” side of the safety spectrum, but if I’m being honest I’ll be first in line - rain, sleet, or snow - to buy the chatgpt that can!
Right, so as you've pointed out, chatgpt wouldn't have helped me in any way.
And, for the record, I was referred to a Urologist. The doctor was able to say "something isn't right but I don't know what" and he referred me to a Urologist and got my ultrasounds.
No offense to uh... AI... but it's not very good at being humble. It seems hubris wasn't in the training data. It's quite difficult to get an LLM to say "I don't know". Usually you have to prime it or ask leading questions, which at that point you already know that it doesn't know so therefore trying to elicit that response is worthless.
> Why do I have to pay $93 (on top of the cough syrup that costed ~$44) just so that the doctor can see me on Zoom for less than 5 mins and submit an order for the med?
Because you're paying for the expertise of someone who studied for more than a decade which you won't get from a random web search.
An AI system with today's technology should be less trustworthy for medical diagnosis than a web search. At least with a web search you might stumble upon a site with content from experts, assuming you trust yourself to be able to discern expert advice from bot-generated and spam content. Even if a doctor is doing the searching instead of me, I would pay them only for their knowledge to make that discernment for me. Why you think an AI could do better than a human at that is beyond me.
Your question reminds me of that famous Henry Ford GE invoice story:
I dislike that quote. It's not knowing where to make the mark. It's that there's no one else who knows where to make the mark charging less that $9,999. Knowing where to make the mark implies that payment is proportional to ability/productivity, but it is proportional chiefly to relative positions of power. Learning where to make the mark won't earn you $9,999, it will instead put you in competition with the other $9,999 markers in a downward race to the "true value" of those marks.
> Why you think an AI could do better than a human at that is beyond me.
You strike me as someone who thinks like me before I started using LLMs. Now when I read things like this rather than thinking in terms of black and white like I might have before, I instead think about percent likelihoods, and once you let in the grey it's no longer a matter of can/cannot but how many resources will be invested into making it so.
I think you hold doctors in pretty high esteem and I'm not sure how many other people hold doctors in as high esteem as you do. I've certainly ended up worse for having visited one. How seriously do you think most doctors take continuing education, particularly if they are overworked? An AI can consume the latest research/guidelines/text books immediately. Doctors can often act on 20 year old medical dogma. AI can get immediate social media feedback weighted in a way professionals deem appropriate.
I think diagnostic work is one of the things most likely to be taken over by AI. Even when asking specific technical questions in my own field, AI is capable of hitting some key points, and with extra prodding even getting to some very very specific niche points without too much coddling.
> You strike me as someone who thinks like me before I started using LLMs.
Cool, except I've been using LLMs since their inception, and they're now part of my daily workflow, for better or worse. My opinion is based purely on my experience as a user over the last few years.
Besides, my point in that paragraph was specifically about the ability to judge accuracy in information and to produce accurate information, not whether AI can outperform humans in other tasks. Modern AI systems fail miserably at this, which might be acceptable for casual use, but when human lives depend on that accuracy, such as in the healthcare industry, the process of integrating them should be gradual, with a lot of oversight and regulation. We've done that in the automotive industry, for obvious reasons, and the same should be done in others.
> I think you hold doctors in pretty high esteem and I'm not sure how many other people hold doctors in as high esteem as you do.
We hold the medical profession, not individual doctors, in high esteem for good reasons. It is critical to society, and requires intense education for many years. At the end of the day, doctors are as fallible as any human, but they're required to be licensed, and we have regulations in place for medical malpractice. The idea that a computer system based on probabilities and randomness, that hallucinates most of the time, will be more reliable than the average human doctor is ludicrous, regardless of what some benchmark says.
I'm not against using AI in healthcare. Pattern matching to detect early stages of cancer is a very good use case that's been helping radiologists for many years now. But that's the key word: _helping_. My concern is that when these systems are used for medical diagnosis and advice, they will instead be relied on much more than they should be.
I mean, it really depends on what you're trying to diagnose/do. Yeah, a math PhD might have spent a decade+ studying the subject. Doesn't mean I need to pay for their skills instead of using a calculator/program to solve my math problem.
> Why you think an AI could do better than a human at that is beyond me.
Why do you think an AI couldn't do better than a human, when we have ample evidence of computers/AI exceeding humans in many areas?
> Why do you think an AI couldn't do better than a human, when we have ample evidence of computers/AI exceeding humans in many areas?
I was specifically referring to the ability of discerning between accurate content and nonsense. SOTA LLMs today produce nonsensical output themselves, partly due to their training data being from poor quality sources. Cleaning up and validating training data for accuracy is an unsolved and perhaps unsolvable problem. We can't expect AI to do this for us, since this requires judgment from expert humans. And for specific applications such as healthcare, accuracy is not something you can ignore by placing a disclaimer.
Many human 'experts' produce nonsensical data. Verification of data by humans is also mostly based on 'prior' data. We've had many popular medical practices over the years developed and supported by medical experts which turned out to be completely wrong.
The main thing missing right now, imo, is the ability for LLMs to verify data via experimentation, but this is completely solvable.
There's a limit though right? How about, if you can accurately predict the doctor's diagnosis it's free? If not you pay. This person needed a doctor's approval presumably for a prescription they couldn't get without authorization which leads to this gatekeeping. Not to mention also contributes to the insane medical costs in the US. $93 for 5 minutes is over 1k _an hour_ for what amounts to a rubber stamp (because how much can you really dig into a person's medical history and condition in 5 mins).
An AI system with today's technology should be less trustworthy for medical diagnosis than a web search.
This is the problem with reasoning from first principles. This statement is easily proven false by giving it a try, whether it "should" be true or not.
Where are you that you need a prescription to get cough medicine? The only ones I know of that require prescription are the ones with controlled substances.
Benzonatate is a commonly prescribed somewhat stronger cough medicine which requires a prescription. I've known a number of people who've had to take it for a persistent or stubborn cough.
Wikipedia says "Overdose of only a small number of capsules can be fatal" but the study they quote shows that you'd need to take 10x-20x the usual dose for the smallest amount that's been shown to kill anyone -- which is admittedly more narrow of a safety window than over-the-counter Tylenol/Acetaminophen/Paracetamol but still, few people take 10-20 pills in one day of something that says "take 1 per day".
NPs and community paramedics are definitely the future for a huge chunk of the current GP workload, but PAs are a failed experiment which are being rolled back pretty much everywhere.
Are you sure about that? The ranks of PAs have been expanding rapidly. I know there has been some political resistance over expanding their scope of practice but I'm not aware of it being significantly rolled back.
The US is the global outlier here and it's in the face of the evidence. The number of PAs globally is plummeting as they're abolished by national authorities.
Nah. Total PA numbers are increasing in many other developed countries as well, including Canada and much of Europe. We can have a discussion about whether this is a good idea, but first you need to stop lying and making things up.
PAs have to practice under some level of physician oversight pretty much everywhere. That part was never in dispute. But they can offload physicians for some routine primary care cases where a real doctor isn't needed.
The data clearly shows that PA numbers continue increasing in many countries, so obviously they don't consider it a failed experiment and you're just lying to push some kind of personal agenda. It's clear you have no experience in this area, so I wonder why the need to comment at all?
We are the global outlier because it costs an arm and a leg to go to college and medical school. In the rest of the world the education is free/ still affordable. And thats why all the young doctors are also imports.
Yes and being able to buy and take whatever you want has led to rampant antibiotics abuse.
Unfortunately medicine at the high level always has to cater to the lowest common denominator.
Personally, while I agree AI can probably take up some of the mundane triage I have some reservations about AI in its current LLM form being a good judge as patients are unreliable narrators and doctors rely on many senses and heuristics to make their recommendations.
An AI fine tuned to be a doctor with perfect health data and patient history is an interesting prospect, especially given the loss of the "family doctor" but that kind of data collection is a way away and has other Pandora's box considerations
I mean, if such is your typical experience with healthcare, what you need is a big shake up, don't count on AI to help you there! (but it might inflate the profit margins for the few running the show at your expense).
IMHO it is much easier to increase the supply of AI doctor hours than the supply of human doctor hours, and supply constraints are the main thing maintaining doctor hours at the high price it currently sells for.
Isn't there an obvious conflict of interest when the model maker is also the creator of a benchmark? I think at the very least it should be from a separate business entity under the non profit or from the non profit holding entity itself
I don't think it's necessarily bad to have the benchmark, but the graphs of Gemini and Claude doing worse than o3 did kind of leave a bad taste in my mouth. "Oh look, your models are worse than ours at this very important metric that we just made up! How terrible!"
It's also barely better, within the error bar on some metrics. So if this is the best they can do on their own benchmark that's pretty pathetic and shows they have no edge anymore.
Good lord. The idea that a system that produces pseudo-random output without any semantic understanding can be relied on to give accurate health-related information is deeply flawed and troubling. It's one thing to use these things for finding patterns in data, for entertainment purposes, and for producing nonsensical code a human has to fix, but entirely different to rely on them for health diagnosis or advice. I shudder at the thought that a medical practitioner I go to will parrot whatever an LLM told them.
Can you state more clearly why it's deeply flawed?
Because while LLMs obviously have massive limitations, so do humans, and it's not entirely clear to me that some synthesis of the two can't produce much better results than either on its own.
> ... some synthesis of the two can't produce much better results than either on its own.
In theory, I agree with you. The world "some" is doing a lot of heavy lifting there though. I only hope that whatever definition of some emerges, it's not a horribly flawed one.
That's my hope as well. A responsible doctor that uses technology critically and knows its limitations is fine with me. My concern is that it's more likely that whatever AI says will be taken for granted, overruling any human judgment, because they're supposed to be expert systems that scored highly on some made-up benchmark produced by the same company that trained the model.
We see this today with AI-generated content on the web, and a flood of sloppily put together software produced by people who swear that AI is making them more productive. There's little interest in judging the output, and a lot of interest in lazy cash grabs. There are no guardrails in place in the healthcare industry AFAIA to prevent the same happening there, which is a scary thought.
You would not hire or go to a a mentally unstable doctor. If a doc hallucinates or makes a mistake, it will be rare. They can self introspect and work on it if they realise there mistake even. AI is NOT same. It has no self reflection, hallucination isn't rare. AI is not mentally stable in any sense.
Whether you (or anyone else) actually understands something is opaque to me. For all I know, your comment was left by an AI. There is no magic wand we can wave to separate out the people who actually understand something (whatever that means), from the people who just appear to.
The best we can do is test a human's mastery of a subject to estimate how well they actually know and understand that topic. Which is exactly what OpenAI is doing here.
What I care about is the results. If the "grade" is 10%, then I don't want to rely on it, whether it's a human or an AI. If it's 95%, then I feel fine about relying on it. Especially since I suspect that very soon, most doctors would not score as well on a benchmark like this as the SOTA models.
> What I care about is the results. If the "grade" is 10%, then I don't want to rely on it, whether it's a human or an AI.
So you judge mastery of a subject by some arbitrary grade? Produced, incidentally, by the same company that trained the model. But I'm sure they wouldn't cheat...
I also care about the results, but I judge it based on direct experience. IME the current SOTA models can't be relied on to accurately produce working code most of the time, yet I'm expected to trust they can produce accurate healthcare diagnosis and advice? Give me a break.
AI companies have sold us a fantasy that a neat computer trick, which can still be very useful for society, will somehow lead to AGI. Now that we're a decade into this hype cycle and investors are getting antsy, they're doubling down by anthropomorphizing the technology and selling us "chain of thought" and "reasoning", as if making everyone say these things will somehow magically produce intelligence.
What's really irresponsible is that the public and governments are buying into it, which is pushing the tech into industries that can actually harm people. Safety is not just about trusting some arbitrary grade.
>Now that we're a decade into this hype cycle and investors are getting antsy, they're doubling down by anthropomorphizing the technology and selling us "chain of thought" and "reasoning", as if making everyone say these things will somehow magically produce intelligence.
The Transformer, nevermind GPT-3 did not exist a decade ago. I guess LLMs aren't the only things that hallucinate and spout confident nonsense.
>I also care about the results, but I judge it based on direct experience. IME the current SOTA models can't be relied on to accurately produce working code most of the time, yet I'm expected to trust they can produce accurate healthcare diagnosis and advice? Give me a break.
You don't see the fallacy of forcing your experience in one domain onto a completely unrelated one regardless of any evidence to the contrary (even if you don't trust open ai, this is hardly the only paper trialing SOTA LLMS for diagnosis)? What does code have to do with diagnosis ? And while the current SOTA is by no means perfect, if you can't get them to produce working code, that's a you problem. At the very least, many users would disagree.
Yeah benchmarks aren't perfect. Doesn't mean they aren't useful. Certainly a lot more useful than your approach.
> The Transformer, nevermind GPT-3 did not exist a decade ago.
I'm attributing the start of the current AI hype cycle to the resurgence of CNNs using GPUs, roughly around the time of AlexNet and AlphaGo, not to LLMs and the Transformer architecture. Though if we're being really pedantic, the original Transformer paper is from 2017, so almost a decade ago. But this is beside my point.
> You don't see the fallacy of forcing your experience in one domain onto a completely unrelated one
The machine has no concept of a "domain". Whether it's outputting code, poetry, images, or video, it's all data generated by probabilistic pattern matching and pseudo-randomness. The structure and accuracy of the generated data is meaningful only to humans, and it's the most important factor that is measured in all these benchmarks.
We might find it collectively amusing when an AI produces funny text and weird looking pictures and video. Some might find it acceptable when it produces buggy code that humans need to fix, or when it mimics an expert by confidently spouting nonsense, which is where I personally draw the line. But we should all be concerned when the same models are used in industries where human lives depend on critical thinking by experts.
We have been promised fully autonomous vehicles for more than a decade now, and only in the last couple of years have some parts of that promise begun to come true in very limited scenarios. We're all understandably reluctant to give control of a speeding 2-ton object to a machine, for obvious reasons. The process has been very gradual, with a lot of oversight and regulation, as there should be. All I'm saying is that there should be an equal amount of oversight in other industries as well, particularly healthcare. Arbitrary benchmarks don't make me trust these systems more, regardless of who produces them.
>I'm attributing the start of the current AI hype cycle to the resurgence of CNNs using GPUs, roughly around the time of AlexNet and AlphaGo, not to LLMs and the Transformer architecture.
The current hype cycle, the one fueling hundreds of billions in investment by multiple of the biggest tech companies in the world has little to do with Alex Net and AlphaGo and everything to do with LLMs and generative AI in general.
>Though if we're being really pedantic, the original Transformer paper is from 2017, so almost a decade ago. But this is beside my point.
The transformer paper did not start the generative AI craze. GPT-3 in 2020 did it for the research world, and the release of ChatGPT in Nov 22 did it for the general public.
>The machine has no concept of a "domain". Whether it's outputting code, poetry, images, or video, it's all data generated by probabilistic pattern matching and pseudo-randomness. The structure and accuracy of the generated data is meaningful only to humans, and it's the most important factor that is measured in all these benchmarks.
What are you on about? Of course it does. This conversations are getting tiring. Yes, LLMs model concepts directly independent of the text it is trained on. This has been demonstrated multiple times including very recently again by anthropic. There is nothing random about the predictions they make.
And even if the machine didn't model these things directly, the concept of domains would still be relevant to the humans testing it as 'data' is not equal.
SOTA LLMs are good for diagnosis. That was evident even before this benchmark. I'm not talking about some imagined future. I'm talking about right now. Sticking your head in the sand because of your 'experience' with coding is nonsensical. The benchmarks aren't arbitrary - In some cases they are directly testing the ability in question.
I'm not advocating for removing doctors from the picture entirely. It wouldn't even be possible even if I was, at least at the moment.
Good lord. The idea that a sleep‑deprived, bias‑prone mammal can be relied on to give accurate health‑related information is deeply flawed and troubling. It’s one thing to seek a human for empathy, small talk, or a routine vaccine, but entirely different to assume an overworked doctor will consistently out‑diagnose a modern data tool. I shudder at the thought that the practitioner I visit might simply repeat whatever a drug‑rep lunch pitch or outdated guideline left in their memory.
Are there regulations regarding doctor's working hours similar to say, truck drivers? I found this [1] article which states doctors are limited to 80 hours a week whereas truck drivers are limited to 70 hours in a 7 day period. [2]
If the available evidence does not show an improvement in patient outcomes with shorter hours, and if longer hours are worked willingly and enable them to treat more patients (iso widespread physician shortages), yes I do.
Turning the question around, if mandating shorter hours for physicians were found to result in higher rates of significant patient harm, would you be in favor of those restrictions?
I like that they include the "worst case score at k samples". This is a much more realistic view of what will happen, because someone will get that 1/100 response.
You can't download Gemini's weights either, so it's not relevant as a comparison against Gemini.
I think the actually-relevant issue here is that until last month there wasn't API access for Grok 3, so no one could test or benchmark it, and you couldn't integrate it into tools that you might want to use it with. They only allowed Grok 2 in their API, and Grok 2 was a pretty bad model.
Yes OpenAI has a first-mover advantage and Claude seems to be close as a second player with their closed models too, open weights is not a requirement for success but in an already crowded market (grok's prospect) their preposition isn't competing neither with top tier closed models nor the maybe lesser-capable but more-available battle-tested freely available to run locally open ones
Happy to see this. I've struggled with an injury for the past five years. I've been to multiple sports-focused physicians, had various scans. Responses from doctors have ranged from "everything seems fine, can't really figure this out" to [completely wrong hypothesis]. Tried acupuncture. Tried a chiropractor. I remember one doctor, though, had an interesting thought that seemed to make sense - but I've been so discouraged from so many false starts or misplaced hope, I didn't bother following up.
Finally I typed in my entire history into o3-deep-research and let it rip for a while. It came back with a theory for the injury that matched that one doctor, diagrams of muscle groups and even illustrations of proposed exercises. I'm not out of the woods yet, but I am cautiously optimistic for the first time in a long time.
Why would you trust a LLM over a battery of human experts? I find it hard to believe that the doctors never proposed exercises or some kind of physical therapy for you, at least in the US.
I've never used LLMs for this, but as someone who's been through a lot of sports-related injuries, I find doctors more or less useless (except for prescribing painkillers and performing surgeries.)
No doctor or physio has ever been able to fix my chronic issues, and I've always had to figure them out myself through lots of self-study and experimentation.
I can't speak to the OP's condition, but having seen plenty of doctors and physical therapists in the US for over a decade:
Yes, they propose exercises.
No, they don't work.
For certain (common) conditions, PT seems to have it nailed - the exercises really help. For the others, it's just snake oil. Not backed by much research. The current state of the art is just not good when it comes to chronic pain.
So while I don't know if an LLM can be better than a battery of human experts, I do know that those human experts do not perform well. I'm guessing with the OP's case, that battery of human experts does not lead to a consensus - you just end up with 10 different treatments/diagnoses (and occasionally, one is a lot more common than the other, but it's still wrong).
Much of modern medicine's expertise is focused on areas that reimburse well (in some part because those industries subsidize the research).
I would not expect most physicians to have a deep fund of literature-backed knowledge to draw from regarding exercise. Telling someone to do an exercise probably doesn't compensate well.
That said, I'm also pretty negative about the availability of rigorous literature regarding much of nutrition, dentistry, podiatry, physical therapy, etc... you know, the things that affect the health of most human beings that have ever lived.
Sports medicine specialists do this all day every day. They don't hesitate to prescribe specific exercises if they think it will help, and as for compensation they get paid the same for that service as any other regular office visit (typically CPT code 99212 or similar).
Because there is so much variability in individual injuries and physiology it's extremely difficult to do rigorous studies comparing different treatments. Like even something common like a rotator cuff tear isn't one single thing that can always be treated the same way. Patients and practitioners will often have to follow a trial-and-error process until they figure out what works in a particular case. Experienced providers who see a lot of cases eventually develop a lot of tacit knowledge about this in a way that's difficult to codify or explain.
Do you therefore agree that most physicians do not "deep fund of literature-backed knowledge to draw from regarding exercise"?
> Because there is so much variability in individual injuries and physiology it's extremely difficult to do rigorous studies comparing different treatments.
This is generally true of research in all areas of medicine; I don't see it as a fundamental reason why we wouldn't hope for equally powered and rigorous studies focusing on the issues that affect most of the population, even if there are not pharmaceutical or device companies that will profit from the recommended interventions.
I'm only a decade in, but my clinical experiences in my area of expertise vary widely from those of my colleagues, which vary from those of other colleagues, in spite of years of shared experience in a fairly homogenous demographic. Expert opinion is not a level of evidence that carries much weight in my mind.
It may be the best we have in many cases, I just wish we had something better.
You have to be realistic and assume that treating most major musculoskeletal injuries will be a lengthy trial-and-error process. Every injury is unique and imaging studies or other tests can only give limited clues about what's going on. There are patients in terrible pain whose scans look normal, and there are other patients whose scans indicate serious pathology yet they are pain-free and completely functional. Instead of getting discouraged, keep trying different modalities (except for the pseudo-science BS) until something works.
In general a lot of those injuries will eventually heal on their own. So it's easy to fool yourself into believing that a particular treatment was effective even when the real cure was time.
Good DPT with a team of PTs and PTAs, and possibly also see a DO who can do osteopathic manipulation therapy (OMT).
Start with a dedicated orthopaedic practice who will make referrals to the above and then order CATs/MRIs if the above isn’t helping and evaluate if surgery is necessary.
One challenge is that it is hard to even know how to navigate all this. LLMs could actually help with that.
I hope recent cuts to government science have managed to hit enough of the safetyists and industry captures who keep us from just trying out new healthcare approaches like this and learning. They'd like nothing better than to replace the help you got with "As a large language model, I am unable to offer medical advice."
This appears to be a very thoughtful and helpful study. It's also impressive to see the improvement in performance in just the last year of model development - almost double.
I've found o3 & deep research to be very effective in guiding my health plan. One interesting anecdote - I got hit in the chest (right over the heart) quite hard a month or so ago. I prompted o3 with my ensuing symptoms and heart rate / oxygenation data from my Apple watch, and it already knew my health history from previous conversations. It gave very good advice and properly diagnosed me with a costochondral sprain. It gave me a timeline to expect (which ended up being 100% accurate) and treatments / ointments to help.
IMO - it's a good idea to have a detailed prompt ready to go with your health history, height/weight, medications and supplements, etc. if anything's happening to you you've got it handy to give to o3 to help in a diagnosis.
SpO2 data from an Apple Watch (or any wrist device) is highly suspect and not reliable as a diagnostic tool. Under ideal conditions it can work reasonably well but it can also be thrown way off by multiple factors like wristband tightness, skin color, patient movements, dirt, etc. A cheap and simple dedicated pulse oximeter that clips on to your finger will generally be much more accurate and consistent.
How confident was it in that diagnosis? Did you trust its confidence if so? I asked this to my medico wife and apparently the other possibility is pericardial effusion from injury which can be life threatening.
Definitely - o3 proposed a few things initially: Costochondral separation, Intercostal strain, rib fracture, pericardial effusion and a few other things...
After answering questions about symptoms, it gave me a few movements/actions to do and to rate the pain of each. Based on that it diagnosed me and gave me a timeline of what to expect and things to watch out for.
Really it was as helpful as a doctor could have been in that scenario, IMO.
My sense is that these benchmarks are not realistic in terms of the way the model is used. People building specialized AI systems are not, in my experience, letting users just chat with a base model, they would have some variant of RAG plus some guardrails plus other stuff (like routing to pre-written answers for common question).
So what use case does this test setup reflect? Is there a relevant commercial use case here?
For general medical Q&A I can't see how a specialized system would be better than base o3 with web search and a good prompt. If anything RAG and guardrail prompts would degrade performance.
Non-clinicians are using ChatGPT every day now to try to find assistance (right or wrong) to real-life medical problems. This is a great evaluation set that could prevent a lot of harm
Recently I uploaded a lab report to chatGPT and asked it to summarize it.
It hallucinated serious cancer, along with all the associated details you’d normally find on a lab report. It had an answer to every question I had pre-asked about the report.
Sigh. This is a point in favor of not allowing free access to ChatGPT at all given that people are getting mad at GPT-4o-mini which is complete garbage for anything remotely complex... and garbage for most other things, too.
Just give 5 free queries of 4o/o3 or whatever and call it good.
Or a non-normie. Even while logged in, I had no idea what ChatGPT model it was using, since it doesn't label it. All the label says is "great for everyday tasks".
And as a non-normie, I obviously didn't take its analysis seriously, and compared it to Grok and Gemini 2.5. The latter was the best.
I wonder if it was unable to read your report, and just answered as if role-playing?
I gave it a pdf of an engine manual recently and asked some questions, which it answered reasonably. It even pulled a schematic out for me, though it was the wrong one (it gave me a schematic for the CDI ignition variant that we first talked about, rather than the DSAI one we settled on later.)
Other Key Metrics
Absolute Neutrophils, Lymphocytes, Monocytes, Eosinophils, Basophils: All within normal ranges, suggesting no significant abnormalities in immune cell populations.
Anisocytosis (1+): Indicates variation in red blood cell size.
Hypochromasia (1+): Suggests pale red blood cells, supporting iron-deficiency anemia.
Summary
These results suggest iron-deficiency anemia, characterized by low hemoglobin, low MCV, low MCH, and high RDW. The elevated red blood cell count may reflect compensatory mechanisms or mild dehydration.
Your provider recommends a heart-healthy, low-fat diet, which is great for overall health. However, addressing the anemia might also involve increasing dietary iron or taking iron supplements.
The diagnosis is wrong, btw, I don't have iron deficiency. The anemia is caused by a genetic condition called thalassemia, which has been verified by genetic tests. You can use the Mentzer Index to differentiate the two on a simple CBC - https://www.mdcalc.com/calc/10534/mentzer-index
Mine numbers return a "probable diagnosis."
I was wondering if chatgpt would catch it, nope, it didn't. It did say that it was a possibility once I suggested it though.
It's also disappointing that the LLM would falsely claim that a low-fat diet is "great for overall health". That claim has been thoroughly debunked in the scientific literature (or at least not supported by evidence-based medicine), and yet the recommendation still keeps popping up. While there are certain specific types of fats that can be problematic, in general a moderate amount of fat is fine as a macronutrient.
LLMs don't have a magical ability to arbitrate complex truths. If the "low-fat diet" narrative is prominent in the training literature, I would expect the model to say that more often that it said the opposite.
I don't ever really use the term "thoroughly debunked" when referring to nutrition science; as you noted, a better term is that the claim is not supported by the evidence. I've seen enough things debunked and then rebunked to know that nutrition science is not really that accurate.
What was the outcome (did the subject of the lab report go on to present cancer)?
(sometimes what we think is a false positive ends up being a true positive later, and cancer is a late-arriving signal).
I dunno, cancer is complicated. I'd have to see the details to make that judgement. It really depends on a lot of factors- what the nature of the sample was, what the lab report did, etc.
I walked away from this with a feeling I can't quite put into words.
I'm not a doctor, and I don’t really understand medical AI, but a friend of mine who is a doctor has been relying more and more on ChatGPT lately—to look up guidelines, organize his thoughts. He says it’s not that it knows more than he does, but that it’s fast, clear, and saves him time.
That got me thinking. I used to assume AI in healthcare was about replacing people. Now, it feels more like an extension. Like doctors are getting an extra pair of hands—or a second consultation room that's always online.
Maybe that’s what progress looks like: human judgment is still there, but increasingly shaped by structured input. I don’t know if that’s good or bad. It just feels complicated.
The problem is the healthcare industry will leverage this to place more responsibilities in the hands of increasingly less qualified individuals - as they're already doing with nurses. Costs will never go down, but the salary of healthcare professionals and quality of care will.
AI is a tool to transfer wealth from the middle class to the ultra wealthy. If there is a brief intermediary period where quality of life improves for the poor, it's a temporary coincidence that hasn't been optimized away yet.
Exactly correct. So eventually we’ll pay a nurse’s aide to type prompts into GPT the same price for a visit where we used to see an MD (and then an NP, and then a PA, and then a BSN, and then an RN…), and get charged the same.
That’s a valid concern, especially in a system so focused on cutting costs.
What felt a bit different here is that the doctors helped define the evaluation standards.
If AI gets built with them instead of for them, maybe there’s still some hope it won’t just be about saving money.
the amount of obvious astroturfing by people with investments in LLMs on this website is insane.
you only need to look at this thread alone to prove your point. this is a race to the bottom that we're being accelerated towards by some of the worst people on earth.
If your condition can easily be resolved by waiting a little and letting your body recover, an honest doctor will tell you so. I wonder if an AI will ever risk not recommending you to see a doctor.
No, they will not, actually. At least in the lower-middle, Medicaid spectrum of care.
Physicians at this level serve the labor market, and everything they do is ordered to keeping you able-bodied and working and on the job and not taking sick days.
They have always been reluctant to dispense excuses from work, to document any sort of disability related trouble, or to give advice that does not involve more exercise or more medication.
I have honestly approached multiple doctors in multiple health systems to complain about fatigue, syncope, and exhaustion, and they all go to extraordinary lengths to deny that anything is wrong or that anything should change.
I use Medicaid and have never found that to be the case.
In fact, I’d prefer to be back on full work duties, but I’m still on light duty until my next appointment, assuming he restores me to full duty.
Providers who are reimbursed via Medicaid have zero incentive specific to Medicaid to not give out work “excuses”, document disabilities and so on. Generally speaking the same providers serve both Medicaid and non-Medicaid patients.
Perhaps there are some states where the Medicaid providers are really awful (California?).
Which begs an important question. To what extent is American healthcare built around liability?
There are a lot of innocuous things built into the healthcare system, like gating the cough syrup that gets you really high easily, behind a visit to a provider, that many people are just ignorant of as a virtue of the current setup.
When we disrupt fields that we are not experts in we often miss oceans of nuance.
I’ve used GPT for personal health research—nothing serious, just to help clarify questions before a doctor visit.
It’s useful, but honestly? Also unsettling. The authority it projects is easy to believe, even when it’s just confidently wrong.
HealthBench seems like a small step toward grounding that power in something real. It doesn't fix the problem, but at least it admits it exists.
I would really rather like a benchmark purely focusing on diagnosis. Symptoms, patient history vs the real diagnosis. Maybe name this model House M.D 1.0 or something.
The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.
i have zero trust in openai's ability to do anything impartially. why should we leave the judgement of a private tool up to the makers of the tool especially when human lives are at stake?
> OpenAI CFO Sarah Friar told the Financial Times that OpenAI is weighing the inclusion of ads in its products but wants to be "thoughtful about when and where we implement them."
I agree - we should exercise a bit of caution here. There is no way they would release a benchmark which makes their model look bad. But then again we know that their models are one of the best for other uses so its not a big leap to accept this benchmark.
I have no doubt that a lot of garden-variety diagnoses and treatments can be done by an AI system that is fine-tuned and vetted to accomplish the task. I recently had to pay $93 to have a virtual session with a physician to get prescription for a cough syrup, which I already knew what to take before talking to her because I did some research/reading. Some may argue, "Doctors studied years in med school and you shouldn't trust Google more than them", but knowing human's fallibility and knowing that a lot of doctors do look things up on places like https://www.wolterskluwer.com/en/solutions/uptodate to refresh/reaffirm their knowledge, I'd argue that if we are willing to take the risk, why shouldn't we be allowed to take that risk on our own? Why do I have to pay $93 (on top of the cough syrup that costed ~$44) just so that the doctor can see me on Zoom for less than 5 mins and submit an order for the med?
With the healthcare prices increasing at the breakneck speed, I am sure AI will take more and more role in diagnosing and treating people's common illnesses, and hopefully (doubt it), the some of that savings will be transferred to the patients.
P.S. In contrast to the US system, in my home city (Rangoon, Burma/Myanmar), I have multiple clinics near my home and a couple of pharmacy within two bus stops distance. I can either go buy most of the medications I need from the pharmacy (without prescription) and take them on my own (why am I not allowed to take that risk?) OR I can go see a doctor at one of these clinics to confirm my diagnosis, pay him/her $10-$20 for the visit, and then head down to the pharmacy to buy the medication. Of course, some of the medications that include opioids will only be sold to me with the doctor's prescription, but a good number of other meds are available as long as I can afford them.
The only reason this worked is because your situation was exceedingly simple.
The trouble is you are not educated enough to tell what is simple and what isn't. A cough could be a cough or it could be something more serious, only a "real" examination will reveal that. And sometimes even that's not enough, you need an examination by a specialists.
I'll tell you a story. Once upon a time I got pain in my balls. I went to a doctor and he felt around and he said he didn't feel anything. I went to another doctor and he felt something, but he had no idea what it was. He said could be a cyst, could be a swollen vein, could be an infection - he didn't even know if it was on the testicle or on the tube thingy.
Then I went to a Urologist. You can tell this man has felt up a lot of balls. He felt me up and said, "yup, that's a tumor" almost immediately. He was right, of course, and he ended up being the one to remove it too. Since I caught the cancer pretty early the chemotherapy wasn't too intense.
Point is, expertise matters when things aren't straight forward. Then, experience and perspective gets to shine.
As a competing anecdote, last summer my toddler woke up from a nap with bruises on his legs and swelling around his joints. We initially thought maybe he was tangled up in his crib, but later when changing his diaper found his testicles were swollen and rushed him to the emergency room. Over the next 6-7 days we saw no less that 5 doctors including his pediatrician, orthopedic specialists, and doctors at the ER and urgent care. None of whom were able to give us any answers about this weird bruising and swelling that randomly appeared on his legs. After all of this, during one of his flair ups, I took a picture of his legs and searched it with Google lens. The results set included pictures and symptoms of Henoch-Schonlein purpura [0], that were identical to what he was experiencing. We confirmed this with his pediatrician and decided on a treatment plan, but I was floored at how many doctors we had to see (and how much money we had to spend), only for me to diagnose this on my own with Google lens.
[0] https://www.mayoclinic.org/diseases-conditions/henoch-schonl...
I don't think this is the slam-dunk you think it is. You had to go to three doctors before someone recognized something was wrong. ChatGPT is the substitute for the first two doctors, not the third.
And even there, I bet ChatGPT would have told you to go see a doctor, since it can't feel your balls. And after your first appointment, if you had told it that you still thought something was wrong, it would probably have told you to go see a urologist.
>And even there, I bet ChatGPT would have told you to go see a doctor
So what is the value add of ChatGPT in this scenario? Either it says you don't need to see the doctor which could very well kill you or it tells you to visit the doctor and it doesn't actually help you in any way.
You're missing the case of it telling you not to see a doctor and it's right, in which you save a lot of time (and potentially money).
Mistakes will happen, that's the nature of statistics of this nature. But don't forget that saving 1,000s of people time is worth something, even if you never directly see it like you do the one case that it didn't catch.
> But don't forget that saving 1,000s of people time is worth something
How many unnecessary deaths does that buy you?
The value-add would be getting my insurance to approve me to go see a "specialist" (the urologist) immediately rather than forcing me to schedule an appointment with my primary-care doctor first, paying $200 for the office visit, and delaying care by 2 weeks.
So, a technical solution to a bureaucratic problem.
Also, even when I'm on a PPO plan instead of HMO...I know when I have a yeast infection. I don't want to pay $100 to tell a doctor that I have a yeast infection. They don't even look at my genitals - they just take my word for it. But I can't buy fluconazole without a prescription. Such a waste of everyone's time and my money.
So again, a technical solution to a bureaucratic problem.
That isn’t value created by the AI, it is valued bestowed on it by your insurance company. They could bestow that same value on a Google search or a call center worker in Southeast Asia and it would provide you the same value.
The AI is just making an already convoluted system more convoluted.
Get a PPO plan, then? I can go directly to any specialist I want, although I find it’s better to always talk to my PCP first, who then makes a referral and often recommends the specialist she prefers.
(If your HMO charges $200 for a PCP visit - wow, I’d look for a different plan.)
My second example was specifically called out as an example of an issue even with PPO's. Also, not everyone (including myself) can easily change jobs or get insurance that simply isn't even offered by the employer. Worth noting that in my state for example, none of the public ACA health insurance options are a PPO.
> I bet ChatGPT would have told you to go see a doctor, since it can't feel your balls.
Not _yet_!
I fall slightly on the “doomer” side of the safety spectrum, but if I’m being honest I’ll be first in line - rain, sleet, or snow - to buy the chatgpt that can!
Right, so as you've pointed out, chatgpt wouldn't have helped me in any way.
And, for the record, I was referred to a Urologist. The doctor was able to say "something isn't right but I don't know what" and he referred me to a Urologist and got my ultrasounds.
No offense to uh... AI... but it's not very good at being humble. It seems hubris wasn't in the training data. It's quite difficult to get an LLM to say "I don't know". Usually you have to prime it or ask leading questions, which at that point you already know that it doesn't know so therefore trying to elicit that response is worthless.
> ChatGPT is the substitute for the first two doctors, not the third.
The third doctor used to behave more like the first two. In a world where ChatGPT replaced the rookies, how do we get good doctors?
> Why do I have to pay $93 (on top of the cough syrup that costed ~$44) just so that the doctor can see me on Zoom for less than 5 mins and submit an order for the med?
Because you're paying for the expertise of someone who studied for more than a decade which you won't get from a random web search.
An AI system with today's technology should be less trustworthy for medical diagnosis than a web search. At least with a web search you might stumble upon a site with content from experts, assuming you trust yourself to be able to discern expert advice from bot-generated and spam content. Even if a doctor is doing the searching instead of me, I would pay them only for their knowledge to make that discernment for me. Why you think an AI could do better than a human at that is beyond me.
Your question reminds me of that famous Henry Ford GE invoice story:
> Making chalk mark on generator: $1.
> Knowing where to make mark: $9,999.
I dislike that quote. It's not knowing where to make the mark. It's that there's no one else who knows where to make the mark charging less that $9,999. Knowing where to make the mark implies that payment is proportional to ability/productivity, but it is proportional chiefly to relative positions of power. Learning where to make the mark won't earn you $9,999, it will instead put you in competition with the other $9,999 markers in a downward race to the "true value" of those marks.
> Why you think an AI could do better than a human at that is beyond me.
You strike me as someone who thinks like me before I started using LLMs. Now when I read things like this rather than thinking in terms of black and white like I might have before, I instead think about percent likelihoods, and once you let in the grey it's no longer a matter of can/cannot but how many resources will be invested into making it so.
I think you hold doctors in pretty high esteem and I'm not sure how many other people hold doctors in as high esteem as you do. I've certainly ended up worse for having visited one. How seriously do you think most doctors take continuing education, particularly if they are overworked? An AI can consume the latest research/guidelines/text books immediately. Doctors can often act on 20 year old medical dogma. AI can get immediate social media feedback weighted in a way professionals deem appropriate.
I think diagnostic work is one of the things most likely to be taken over by AI. Even when asking specific technical questions in my own field, AI is capable of hitting some key points, and with extra prodding even getting to some very very specific niche points without too much coddling.
> You strike me as someone who thinks like me before I started using LLMs.
Cool, except I've been using LLMs since their inception, and they're now part of my daily workflow, for better or worse. My opinion is based purely on my experience as a user over the last few years.
Besides, my point in that paragraph was specifically about the ability to judge accuracy in information and to produce accurate information, not whether AI can outperform humans in other tasks. Modern AI systems fail miserably at this, which might be acceptable for casual use, but when human lives depend on that accuracy, such as in the healthcare industry, the process of integrating them should be gradual, with a lot of oversight and regulation. We've done that in the automotive industry, for obvious reasons, and the same should be done in others.
> I think you hold doctors in pretty high esteem and I'm not sure how many other people hold doctors in as high esteem as you do.
We hold the medical profession, not individual doctors, in high esteem for good reasons. It is critical to society, and requires intense education for many years. At the end of the day, doctors are as fallible as any human, but they're required to be licensed, and we have regulations in place for medical malpractice. The idea that a computer system based on probabilities and randomness, that hallucinates most of the time, will be more reliable than the average human doctor is ludicrous, regardless of what some benchmark says.
I'm not against using AI in healthcare. Pattern matching to detect early stages of cancer is a very good use case that's been helping radiologists for many years now. But that's the key word: _helping_. My concern is that when these systems are used for medical diagnosis and advice, they will instead be relied on much more than they should be.
You answered why someone would want to pay $93, but not why they have to pay $93.
I mean, it really depends on what you're trying to diagnose/do. Yeah, a math PhD might have spent a decade+ studying the subject. Doesn't mean I need to pay for their skills instead of using a calculator/program to solve my math problem.
> Why you think an AI could do better than a human at that is beyond me.
Why do you think an AI couldn't do better than a human, when we have ample evidence of computers/AI exceeding humans in many areas?
> Why do you think an AI couldn't do better than a human, when we have ample evidence of computers/AI exceeding humans in many areas?
I was specifically referring to the ability of discerning between accurate content and nonsense. SOTA LLMs today produce nonsensical output themselves, partly due to their training data being from poor quality sources. Cleaning up and validating training data for accuracy is an unsolved and perhaps unsolvable problem. We can't expect AI to do this for us, since this requires judgment from expert humans. And for specific applications such as healthcare, accuracy is not something you can ignore by placing a disclaimer.
Many human 'experts' produce nonsensical data. Verification of data by humans is also mostly based on 'prior' data. We've had many popular medical practices over the years developed and supported by medical experts which turned out to be completely wrong.
The main thing missing right now, imo, is the ability for LLMs to verify data via experimentation, but this is completely solvable.
There's a limit though right? How about, if you can accurately predict the doctor's diagnosis it's free? If not you pay. This person needed a doctor's approval presumably for a prescription they couldn't get without authorization which leads to this gatekeeping. Not to mention also contributes to the insane medical costs in the US. $93 for 5 minutes is over 1k _an hour_ for what amounts to a rubber stamp (because how much can you really dig into a person's medical history and condition in 5 mins).
AI systems have been improving. O3 now has the capability to decide to search multiple times as part of its response.
Even o4-mini (free) uses web searches and runs Python scripts very eagerly. I'm not sure how long they'll be able to afford giving all of that away.
>Why you think an AI could do better than a human at that is beyond me.
AI can take your entire medical history and the entire history of physics, chemistry, biology, and medicine up to May 2025...
A human was sleep deprived in Med school 15 years ago.
An AI system with today's technology should be less trustworthy for medical diagnosis than a web search.
This is the problem with reasoning from first principles. This statement is easily proven false by giving it a try, whether it "should" be true or not.
Where are you that you need a prescription to get cough medicine? The only ones I know of that require prescription are the ones with controlled substances.
Benzonatate is a commonly prescribed somewhat stronger cough medicine which requires a prescription. I've known a number of people who've had to take it for a persistent or stubborn cough.
Wikipedia says "Overdose of only a small number of capsules can be fatal" but the study they quote shows that you'd need to take 10x-20x the usual dose for the smallest amount that's been shown to kill anyone -- which is admittedly more narrow of a safety window than over-the-counter Tylenol/Acetaminophen/Paracetamol but still, few people take 10-20 pills in one day of something that says "take 1 per day".
Ain't nothing wrong with a little Wockhardt to get you through the day
> Of course, some of the medications that include opioids will only be sold to me with the doctor's prescription
Yes, and the commenter was not complaining about such cases!
There are simply not enough doctors for doctors to be doing this work.
We have a massive, massive shortage of doctors.
The industry is doing everything they can to make it worse by the day, so I won't hold my breath that we'll get the slightest bit of respite.
It'd obviously be ideal if everyone could see a doctor for an affordable price any time they wanted.
We don't live in the ideal world.
This would be a HUGE win for most people.
This is why most routine primary care can be delivered by Nurse Practitioners and Physician Assistants working under a doctor's supervision.
NPs and community paramedics are definitely the future for a huge chunk of the current GP workload, but PAs are a failed experiment which are being rolled back pretty much everywhere.
Are you sure about that? The ranks of PAs have been expanding rapidly. I know there has been some political resistance over expanding their scope of practice but I'm not aware of it being significantly rolled back.
https://www.bls.gov/ooh/healthcare/physician-assistants.htm
The US is the global outlier here and it's in the face of the evidence. The number of PAs globally is plummeting as they're abolished by national authorities.
Nah. Total PA numbers are increasing in many other developed countries as well, including Canada and much of Europe. We can have a discussion about whether this is a good idea, but first you need to stop lying and making things up.
https://capa-acam.ca/pa-profession/pa-facts
https://www.srh-university.de/de/folder/news/2025/04-25/erst...
https://www.bigregister.nl/over-het-big-register/cijfers/ver...
The only country in Europe where PAs made any significant inroads was the UK, where their most significant role has been abolished.
In Germany and all other countries, PAs are unable to treat patients without direct oversight and they are single digit percentages compared to NPs.
It's clear you have no experience in this area, so I wonder why the need to comment at all?
PAs have to practice under some level of physician oversight pretty much everywhere. That part was never in dispute. But they can offload physicians for some routine primary care cases where a real doctor isn't needed.
The data clearly shows that PA numbers continue increasing in many countries, so obviously they don't consider it a failed experiment and you're just lying to push some kind of personal agenda. It's clear you have no experience in this area, so I wonder why the need to comment at all?
We are the global outlier because it costs an arm and a leg to go to college and medical school. In the rest of the world the education is free/ still affordable. And thats why all the young doctors are also imports.
No, it's because the role played by PAs is performed by the much more capable and clinically proven NPs and community paramedics.
Yes and being able to buy and take whatever you want has led to rampant antibiotics abuse.
Unfortunately medicine at the high level always has to cater to the lowest common denominator.
Personally, while I agree AI can probably take up some of the mundane triage I have some reservations about AI in its current LLM form being a good judge as patients are unreliable narrators and doctors rely on many senses and heuristics to make their recommendations.
An AI fine tuned to be a doctor with perfect health data and patient history is an interesting prospect, especially given the loss of the "family doctor" but that kind of data collection is a way away and has other Pandora's box considerations
I mean, if such is your typical experience with healthcare, what you need is a big shake up, don't count on AI to help you there! (but it might inflate the profit margins for the few running the show at your expense).
You think you won’t have to pay the AI? lol.
IMHO it is much easier to increase the supply of AI doctor hours than the supply of human doctor hours, and supply constraints are the main thing maintaining doctor hours at the high price it currently sells for.
My $700 laptop can run an 8B model in 2023.
Isn't there an obvious conflict of interest when the model maker is also the creator of a benchmark? I think at the very least it should be from a separate business entity under the non profit or from the non profit holding entity itself
I don't think it's necessarily bad to have the benchmark, but the graphs of Gemini and Claude doing worse than o3 did kind of leave a bad taste in my mouth. "Oh look, your models are worse than ours at this very important metric that we just made up! How terrible!"
It's also barely better, within the error bar on some metrics. So if this is the best they can do on their own benchmark that's pretty pathetic and shows they have no edge anymore.
Good lord. The idea that a system that produces pseudo-random output without any semantic understanding can be relied on to give accurate health-related information is deeply flawed and troubling. It's one thing to use these things for finding patterns in data, for entertainment purposes, and for producing nonsensical code a human has to fix, but entirely different to rely on them for health diagnosis or advice. I shudder at the thought that a medical practitioner I go to will parrot whatever an LLM told them.
This insanity needs to be regulated yesterday.
Can you state more clearly why it's deeply flawed?
Because while LLMs obviously have massive limitations, so do humans, and it's not entirely clear to me that some synthesis of the two can't produce much better results than either on its own.
> ... some synthesis of the two can't produce much better results than either on its own.
In theory, I agree with you. The world "some" is doing a lot of heavy lifting there though. I only hope that whatever definition of some emerges, it's not a horribly flawed one.
That's my hope as well. A responsible doctor that uses technology critically and knows its limitations is fine with me. My concern is that it's more likely that whatever AI says will be taken for granted, overruling any human judgment, because they're supposed to be expert systems that scored highly on some made-up benchmark produced by the same company that trained the model.
We see this today with AI-generated content on the web, and a flood of sloppily put together software produced by people who swear that AI is making them more productive. There's little interest in judging the output, and a lot of interest in lazy cash grabs. There are no guardrails in place in the healthcare industry AFAIA to prevent the same happening there, which is a scary thought.
You would not hire or go to a a mentally unstable doctor. If a doc hallucinates or makes a mistake, it will be rare. They can self introspect and work on it if they realise there mistake even. AI is NOT same. It has no self reflection, hallucination isn't rare. AI is not mentally stable in any sense.
Whether you (or anyone else) actually understands something is opaque to me. For all I know, your comment was left by an AI. There is no magic wand we can wave to separate out the people who actually understand something (whatever that means), from the people who just appear to.
The best we can do is test a human's mastery of a subject to estimate how well they actually know and understand that topic. Which is exactly what OpenAI is doing here.
What I care about is the results. If the "grade" is 10%, then I don't want to rely on it, whether it's a human or an AI. If it's 95%, then I feel fine about relying on it. Especially since I suspect that very soon, most doctors would not score as well on a benchmark like this as the SOTA models.
> What I care about is the results. If the "grade" is 10%, then I don't want to rely on it, whether it's a human or an AI.
So you judge mastery of a subject by some arbitrary grade? Produced, incidentally, by the same company that trained the model. But I'm sure they wouldn't cheat...
I also care about the results, but I judge it based on direct experience. IME the current SOTA models can't be relied on to accurately produce working code most of the time, yet I'm expected to trust they can produce accurate healthcare diagnosis and advice? Give me a break.
AI companies have sold us a fantasy that a neat computer trick, which can still be very useful for society, will somehow lead to AGI. Now that we're a decade into this hype cycle and investors are getting antsy, they're doubling down by anthropomorphizing the technology and selling us "chain of thought" and "reasoning", as if making everyone say these things will somehow magically produce intelligence.
What's really irresponsible is that the public and governments are buying into it, which is pushing the tech into industries that can actually harm people. Safety is not just about trusting some arbitrary grade.
>Now that we're a decade into this hype cycle and investors are getting antsy, they're doubling down by anthropomorphizing the technology and selling us "chain of thought" and "reasoning", as if making everyone say these things will somehow magically produce intelligence.
The Transformer, nevermind GPT-3 did not exist a decade ago. I guess LLMs aren't the only things that hallucinate and spout confident nonsense.
>I also care about the results, but I judge it based on direct experience. IME the current SOTA models can't be relied on to accurately produce working code most of the time, yet I'm expected to trust they can produce accurate healthcare diagnosis and advice? Give me a break.
You don't see the fallacy of forcing your experience in one domain onto a completely unrelated one regardless of any evidence to the contrary (even if you don't trust open ai, this is hardly the only paper trialing SOTA LLMS for diagnosis)? What does code have to do with diagnosis ? And while the current SOTA is by no means perfect, if you can't get them to produce working code, that's a you problem. At the very least, many users would disagree.
Yeah benchmarks aren't perfect. Doesn't mean they aren't useful. Certainly a lot more useful than your approach.
> The Transformer, nevermind GPT-3 did not exist a decade ago.
I'm attributing the start of the current AI hype cycle to the resurgence of CNNs using GPUs, roughly around the time of AlexNet and AlphaGo, not to LLMs and the Transformer architecture. Though if we're being really pedantic, the original Transformer paper is from 2017, so almost a decade ago. But this is beside my point.
> You don't see the fallacy of forcing your experience in one domain onto a completely unrelated one
The machine has no concept of a "domain". Whether it's outputting code, poetry, images, or video, it's all data generated by probabilistic pattern matching and pseudo-randomness. The structure and accuracy of the generated data is meaningful only to humans, and it's the most important factor that is measured in all these benchmarks.
We might find it collectively amusing when an AI produces funny text and weird looking pictures and video. Some might find it acceptable when it produces buggy code that humans need to fix, or when it mimics an expert by confidently spouting nonsense, which is where I personally draw the line. But we should all be concerned when the same models are used in industries where human lives depend on critical thinking by experts.
We have been promised fully autonomous vehicles for more than a decade now, and only in the last couple of years have some parts of that promise begun to come true in very limited scenarios. We're all understandably reluctant to give control of a speeding 2-ton object to a machine, for obvious reasons. The process has been very gradual, with a lot of oversight and regulation, as there should be. All I'm saying is that there should be an equal amount of oversight in other industries as well, particularly healthcare. Arbitrary benchmarks don't make me trust these systems more, regardless of who produces them.
>I'm attributing the start of the current AI hype cycle to the resurgence of CNNs using GPUs, roughly around the time of AlexNet and AlphaGo, not to LLMs and the Transformer architecture.
The current hype cycle, the one fueling hundreds of billions in investment by multiple of the biggest tech companies in the world has little to do with Alex Net and AlphaGo and everything to do with LLMs and generative AI in general.
>Though if we're being really pedantic, the original Transformer paper is from 2017, so almost a decade ago. But this is beside my point.
The transformer paper did not start the generative AI craze. GPT-3 in 2020 did it for the research world, and the release of ChatGPT in Nov 22 did it for the general public.
>The machine has no concept of a "domain". Whether it's outputting code, poetry, images, or video, it's all data generated by probabilistic pattern matching and pseudo-randomness. The structure and accuracy of the generated data is meaningful only to humans, and it's the most important factor that is measured in all these benchmarks.
What are you on about? Of course it does. This conversations are getting tiring. Yes, LLMs model concepts directly independent of the text it is trained on. This has been demonstrated multiple times including very recently again by anthropic. There is nothing random about the predictions they make.
And even if the machine didn't model these things directly, the concept of domains would still be relevant to the humans testing it as 'data' is not equal.
SOTA LLMs are good for diagnosis. That was evident even before this benchmark. I'm not talking about some imagined future. I'm talking about right now. Sticking your head in the sand because of your 'experience' with coding is nonsensical. The benchmarks aren't arbitrary - In some cases they are directly testing the ability in question.
I'm not advocating for removing doctors from the picture entirely. It wouldn't even be possible even if I was, at least at the moment.
IME the current SOTA models can't be relied on to accurately produce working code most of the time
My experience is the opposite.
Good lord. The idea that a sleep‑deprived, bias‑prone mammal can be relied on to give accurate health‑related information is deeply flawed and troubling. It’s one thing to seek a human for empathy, small talk, or a routine vaccine, but entirely different to assume an overworked doctor will consistently out‑diagnose a modern data tool. I shudder at the thought that the practitioner I visit might simply repeat whatever a drug‑rep lunch pitch or outdated guideline left in their memory.
This insanity needs to be regulated yesterday.
Thankfully, it is. :)
Are there regulations regarding doctor's working hours similar to say, truck drivers? I found this [1] article which states doctors are limited to 80 hours a week whereas truck drivers are limited to 70 hours in a 7 day period. [2]
[1] https://psnet.ahrq.gov/primer/duty-hours-and-patient-safety
[2] https://www.fmcsa.dot.gov/sites/fmcsa.dot.gov/files/docs/Dri...
Those restrictions are for residents (trainees).
They were also found not to improve patient outcomes (possibly due to increased number of handoffs, which are highly error prone).
Do you think it makes sense that doctors can work longer hours than truck drivers?
If the available evidence does not show an improvement in patient outcomes with shorter hours, and if longer hours are worked willingly and enable them to treat more patients (iso widespread physician shortages), yes I do.
Turning the question around, if mandating shorter hours for physicians were found to result in higher rates of significant patient harm, would you be in favor of those restrictions?
Doctors are in high demand, and don’t willingly work at places the demand long hours. Primary care doctors generally aren’t residents.
I like that they include the "worst case score at k samples". This is a much more realistic view of what will happen, because someone will get that 1/100 response.
Impressive how well Grok performs in these tests. Grok feels 'underrated' in terms of how much other models (gemini, llama, etc) are in the news.
I am amazed Gemini did as well as it appears.
Gemini frequently avoids discussing health problems, which likely hurt its scores. My guess is any censorship was considered a fail.
you can't download grok's weights to run locally
how is that relevant here?
it helps explain why theres' less people talking about them than gemini or llama?
less people using them.
You can't download Gemini's weights either, so it's not relevant as a comparison against Gemini.
I think the actually-relevant issue here is that until last month there wasn't API access for Grok 3, so no one could test or benchmark it, and you couldn't integrate it into tools that you might want to use it with. They only allowed Grok 2 in their API, and Grok 2 was a pretty bad model.
lol sorry mixed them up w gemma3 which feels like the open lesser cousin to gemini 2.5/2.0 models
I can guarantee you none of my friends (not in tech) use “downloading weights” as an input to select an LLM application.
isn't chatgpt the most used or most popular model?
Yes OpenAI has a first-mover advantage and Claude seems to be close as a second player with their closed models too, open weights is not a requirement for success but in an already crowded market (grok's prospect) their preposition isn't competing neither with top tier closed models nor the maybe lesser-capable but more-available battle-tested freely available to run locally open ones
It's not.
Also, only one out of the ten models benchmarked have open weights, so I'm not sure what GP is arguing for.
> in terms of how much other models (gemini, llama, etc) are in the news.
not talking about TFA or benchmarks but the news coverage/user sentiment ...
Happy to see this. I've struggled with an injury for the past five years. I've been to multiple sports-focused physicians, had various scans. Responses from doctors have ranged from "everything seems fine, can't really figure this out" to [completely wrong hypothesis]. Tried acupuncture. Tried a chiropractor. I remember one doctor, though, had an interesting thought that seemed to make sense - but I've been so discouraged from so many false starts or misplaced hope, I didn't bother following up.
Finally I typed in my entire history into o3-deep-research and let it rip for a while. It came back with a theory for the injury that matched that one doctor, diagrams of muscle groups and even illustrations of proposed exercises. I'm not out of the woods yet, but I am cautiously optimistic for the first time in a long time.
Why would you trust a LLM over a battery of human experts? I find it hard to believe that the doctors never proposed exercises or some kind of physical therapy for you, at least in the US.
I've never used LLMs for this, but as someone who's been through a lot of sports-related injuries, I find doctors more or less useless (except for prescribing painkillers and performing surgeries.)
No doctor or physio has ever been able to fix my chronic issues, and I've always had to figure them out myself through lots of self-study and experimentation.
I can't speak to the OP's condition, but having seen plenty of doctors and physical therapists in the US for over a decade:
Yes, they propose exercises.
No, they don't work.
For certain (common) conditions, PT seems to have it nailed - the exercises really help. For the others, it's just snake oil. Not backed by much research. The current state of the art is just not good when it comes to chronic pain.
So while I don't know if an LLM can be better than a battery of human experts, I do know that those human experts do not perform well. I'm guessing with the OP's case, that battery of human experts does not lead to a consensus - you just end up with 10 different treatments/diagnoses (and occasionally, one is a lot more common than the other, but it's still wrong).
Much of modern medicine's expertise is focused on areas that reimburse well (in some part because those industries subsidize the research).
I would not expect most physicians to have a deep fund of literature-backed knowledge to draw from regarding exercise. Telling someone to do an exercise probably doesn't compensate well.
That said, I'm also pretty negative about the availability of rigorous literature regarding much of nutrition, dentistry, podiatry, physical therapy, etc... you know, the things that affect the health of most human beings that have ever lived.
Sports medicine specialists do this all day every day. They don't hesitate to prescribe specific exercises if they think it will help, and as for compensation they get paid the same for that service as any other regular office visit (typically CPT code 99212 or similar).
Because there is so much variability in individual injuries and physiology it's extremely difficult to do rigorous studies comparing different treatments. Like even something common like a rotator cuff tear isn't one single thing that can always be treated the same way. Patients and practitioners will often have to follow a trial-and-error process until they figure out what works in a particular case. Experienced providers who see a lot of cases eventually develop a lot of tacit knowledge about this in a way that's difficult to codify or explain.
Do you therefore agree that most physicians do not "deep fund of literature-backed knowledge to draw from regarding exercise"?
> Because there is so much variability in individual injuries and physiology it's extremely difficult to do rigorous studies comparing different treatments.
This is generally true of research in all areas of medicine; I don't see it as a fundamental reason why we wouldn't hope for equally powered and rigorous studies focusing on the issues that affect most of the population, even if there are not pharmaceutical or device companies that will profit from the recommended interventions.
I'm only a decade in, but my clinical experiences in my area of expertise vary widely from those of my colleagues, which vary from those of other colleagues, in spite of years of shared experience in a fairly homogenous demographic. Expert opinion is not a level of evidence that carries much weight in my mind.
It may be the best we have in many cases, I just wish we had something better.
Moved
You have to be realistic and assume that treating most major musculoskeletal injuries will be a lengthy trial-and-error process. Every injury is unique and imaging studies or other tests can only give limited clues about what's going on. There are patients in terrible pain whose scans look normal, and there are other patients whose scans indicate serious pathology yet they are pain-free and completely functional. Instead of getting discouraged, keep trying different modalities (except for the pseudo-science BS) until something works.
In general a lot of those injuries will eventually heal on their own. So it's easy to fool yourself into believing that a particular treatment was effective even when the real cure was time.
Sounds like you needed a good PT not a physician.
Good DPT with a team of PTs and PTAs, and possibly also see a DO who can do osteopathic manipulation therapy (OMT).
Start with a dedicated orthopaedic practice who will make referrals to the above and then order CATs/MRIs if the above isn’t helping and evaluate if surgery is necessary.
One challenge is that it is hard to even know how to navigate all this. LLMs could actually help with that.
i mean it sounds like that doctor had it right then? i dont think this reads as a win for "online self-diagnosis", ai-empowered or otherwise.
From their story, it sounds like the one doctor had it right because the AI-empowered self diagnosis seems to agree. Or do I misunderstand?
You trusted known grifters acupuncturists and chiropractors, but not doctors. Then you trusted a computer known for telling lies.
I think you should take a step back and re-assess your internal heuristics.
They literally said they saw sport focused physicians.
I hope recent cuts to government science have managed to hit enough of the safetyists and industry captures who keep us from just trying out new healthcare approaches like this and learning. They'd like nothing better than to replace the help you got with "As a large language model, I am unable to offer medical advice."
This appears to be a very thoughtful and helpful study. It's also impressive to see the improvement in performance in just the last year of model development - almost double.
I've found o3 & deep research to be very effective in guiding my health plan. One interesting anecdote - I got hit in the chest (right over the heart) quite hard a month or so ago. I prompted o3 with my ensuing symptoms and heart rate / oxygenation data from my Apple watch, and it already knew my health history from previous conversations. It gave very good advice and properly diagnosed me with a costochondral sprain. It gave me a timeline to expect (which ended up being 100% accurate) and treatments / ointments to help.
IMO - it's a good idea to have a detailed prompt ready to go with your health history, height/weight, medications and supplements, etc. if anything's happening to you you've got it handy to give to o3 to help in a diagnosis.
SpO2 data from an Apple Watch (or any wrist device) is highly suspect and not reliable as a diagnostic tool. Under ideal conditions it can work reasonably well but it can also be thrown way off by multiple factors like wristband tightness, skin color, patient movements, dirt, etc. A cheap and simple dedicated pulse oximeter that clips on to your finger will generally be much more accurate and consistent.
How confident was it in that diagnosis? Did you trust its confidence if so? I asked this to my medico wife and apparently the other possibility is pericardial effusion from injury which can be life threatening.
Definitely - o3 proposed a few things initially: Costochondral separation, Intercostal strain, rib fracture, pericardial effusion and a few other things...
After answering questions about symptoms, it gave me a few movements/actions to do and to rate the pain of each. Based on that it diagnosed me and gave me a timeline of what to expect and things to watch out for.
Really it was as helpful as a doctor could have been in that scenario, IMO.
My sense is that these benchmarks are not realistic in terms of the way the model is used. People building specialized AI systems are not, in my experience, letting users just chat with a base model, they would have some variant of RAG plus some guardrails plus other stuff (like routing to pre-written answers for common question).
So what use case does this test setup reflect? Is there a relevant commercial use case here?
You could say the same for code benchmarks, no?
For general medical Q&A I can't see how a specialized system would be better than base o3 with web search and a good prompt. If anything RAG and guardrail prompts would degrade performance.
It tests people chatting to ChatGPT! That's a pretty big and important use case.
Non-clinicians are using ChatGPT every day now to try to find assistance (right or wrong) to real-life medical problems. This is a great evaluation set that could prevent a lot of harm
Recently I uploaded a lab report to chatGPT and asked it to summarize it.
It hallucinated serious cancer, along with all the associated details you’d normally find on a lab report. It had an answer to every question I had pre-asked about the report.
The report said the opposite: no cancer detected.
Interesting. What LLM model? 4o, o3, 3.5? I had horrible performance with earlier models, but o3 has helped me with health stuff (hearing issues).
Whichever the default free model is right now- I stopped paying for it when Gemini 2.5 came out in Google's AI lab.
4o, o4? I'm certain it wasn't 3.5
Edit: while logged in
> Whichever the default free model is right now
Sigh. This is a point in favor of not allowing free access to ChatGPT at all given that people are getting mad at GPT-4o-mini which is complete garbage for anything remotely complex... and garbage for most other things, too.
Just give 5 free queries of 4o/o3 or whatever and call it good.
If you're logged in, 4o, if you're not logged int, 4o-mini. Both don't score well on the benchmark!
This gets at the UX issue with AI right now. How's a normie supposed to know and understand this nuance?
Or a non-normie. Even while logged in, I had no idea what ChatGPT model it was using, since it doesn't label it. All the label says is "great for everyday tasks".
And as a non-normie, I obviously didn't take its analysis seriously, and compared it to Grok and Gemini 2.5. The latter was the best.
Added context: While logged in
Might be worth trying again with Gemini 2.5. The reasoning models like that one are much better at health questions.
Gemini 2.5 in AI Studio gave by far the best analysis
I can’t believe you’re getting downvoted for answering the question about the next-token-predictor model you can’t recall using.
What is happening?
I wonder if it was unable to read your report, and just answered as if role-playing?
I gave it a pdf of an engine manual recently and asked some questions, which it answered reasonably. It even pulled a schematic out for me, though it was the wrong one (it gave me a schematic for the CDI ignition variant that we first talked about, rather than the DSAI one we settled on later.)
>I wonder if it was unable to read your report, and just answered as if role-playing?
FWIW,
This is what chatgpt gave me when I uploaded a lab report, it read the results back to me:
Here’s an interpretation of the blood test results you provided:
White Blood Cell Count (WBC) Value: 9.2 (Normal: 3.8 - 10.8 Thousand/uL)
Within the normal range, indicating no immediate signs of infection or inflammation.
Red Blood Cell Count (RBC) Value: 5.29 (Normal: 3.8 - 5.1 Million/uL)
Elevated (High). This could indicate dehydration, stress, or other conditions like polycythemia.
Hemoglobin Value: 10.9 (Normal: 11.7 - 15.5 g/dL)
Low. Suggests anemia, possibly due to iron deficiency or chronic disease.
Hematocrit Value: 36.0% (Normal: 35.0 - 45.0%)
Within the normal range, but on the lower side, supporting the presence of mild anemia.
Mean Corpuscular Volume (MCV) Value: 68.1 (Normal: 80.0 - 100.0 fL)
Low. Indicates microcytic anemia, often due to iron deficiency.
Mean Corpuscular Hemoglobin (MCH) Value: 20.6 (Normal: 27.0 - 33.0 pg)
Low. Suggests hypochromia (paler red blood cells), often associated with iron-deficiency anemia.
Mean Corpuscular Hemoglobin Concentration (MCHC) Value: 30.3 (Normal: 32.0 - 36.0 g/dL)
Slightly low. May reflect reduced hemoglobin in red blood cells, consistent with iron-deficiency anemia.
Red Cell Distribution Width (RDW) Value: 15.8% (Normal: 11.0 - 15.0%)
Elevated. Indicates variability in red blood cell size, common in iron-deficiency anemia.
Platelet Count Value: 397 (Normal: 140 - 400 Thousand/uL)
Normal, but on the higher side of the range.
Other Key Metrics Absolute Neutrophils, Lymphocytes, Monocytes, Eosinophils, Basophils: All within normal ranges, suggesting no significant abnormalities in immune cell populations.
Anisocytosis (1+): Indicates variation in red blood cell size.
Hypochromasia (1+): Suggests pale red blood cells, supporting iron-deficiency anemia.
Summary These results suggest iron-deficiency anemia, characterized by low hemoglobin, low MCV, low MCH, and high RDW. The elevated red blood cell count may reflect compensatory mechanisms or mild dehydration.
Your provider recommends a heart-healthy, low-fat diet, which is great for overall health. However, addressing the anemia might also involve increasing dietary iron or taking iron supplements.
The diagnosis is wrong, btw, I don't have iron deficiency. The anemia is caused by a genetic condition called thalassemia, which has been verified by genetic tests. You can use the Mentzer Index to differentiate the two on a simple CBC - https://www.mdcalc.com/calc/10534/mentzer-index
Mine numbers return a "probable diagnosis."
I was wondering if chatgpt would catch it, nope, it didn't. It did say that it was a possibility once I suggested it though.
It's also disappointing that the LLM would falsely claim that a low-fat diet is "great for overall health". That claim has been thoroughly debunked in the scientific literature (or at least not supported by evidence-based medicine), and yet the recommendation still keeps popping up. While there are certain specific types of fats that can be problematic, in general a moderate amount of fat is fine as a macronutrient.
LLMs don't have a magical ability to arbitrate complex truths. If the "low-fat diet" narrative is prominent in the training literature, I would expect the model to say that more often that it said the opposite.
I don't ever really use the term "thoroughly debunked" when referring to nutrition science; as you noted, a better term is that the claim is not supported by the evidence. I've seen enough things debunked and then rebunked to know that nutrition science is not really that accurate.
Interesting! Well, that's disappointing.
No, cancer detected!
You're not on Reddit.
I typically respect the decorum here, but I couldn't help myself this time.
What was the outcome (did the subject of the lab report go on to present cancer)? (sometimes what we think is a false positive ends up being a true positive later, and cancer is a late-arriving signal).
The lab report clearly described in medical jargon what the sample was, and that the sample was not cancer.
There was no reference in the report that the LLM might have pulled out to think otherwise.
I dunno, cancer is complicated. I'd have to see the details to make that judgement. It really depends on a lot of factors- what the nature of the sample was, what the lab report did, etc.
[flagged]
I walked away from this with a feeling I can't quite put into words. I'm not a doctor, and I don’t really understand medical AI, but a friend of mine who is a doctor has been relying more and more on ChatGPT lately—to look up guidelines, organize his thoughts. He says it’s not that it knows more than he does, but that it’s fast, clear, and saves him time. That got me thinking. I used to assume AI in healthcare was about replacing people. Now, it feels more like an extension. Like doctors are getting an extra pair of hands—or a second consultation room that's always online. Maybe that’s what progress looks like: human judgment is still there, but increasingly shaped by structured input. I don’t know if that’s good or bad. It just feels complicated.
The problem is the healthcare industry will leverage this to place more responsibilities in the hands of increasingly less qualified individuals - as they're already doing with nurses. Costs will never go down, but the salary of healthcare professionals and quality of care will.
AI is a tool to transfer wealth from the middle class to the ultra wealthy. If there is a brief intermediary period where quality of life improves for the poor, it's a temporary coincidence that hasn't been optimized away yet.
Your. Costs. Will. Never. Go. Down. While. Maintaining. Equal. Service. In. The. Long. Term. That's. Not. What. Capitalists. Want.
Exactly correct. So eventually we’ll pay a nurse’s aide to type prompts into GPT the same price for a visit where we used to see an MD (and then an NP, and then a PA, and then a BSN, and then an RN…), and get charged the same.
That’s a valid concern, especially in a system so focused on cutting costs. What felt a bit different here is that the doctors helped define the evaluation standards. If AI gets built with them instead of for them, maybe there’s still some hope it won’t just be about saving money.
the amount of obvious astroturfing by people with investments in LLMs on this website is insane.
you only need to look at this thread alone to prove your point. this is a race to the bottom that we're being accelerated towards by some of the worst people on earth.
If your condition can easily be resolved by waiting a little and letting your body recover, an honest doctor will tell you so. I wonder if an AI will ever risk not recommending you to see a doctor.
An AI would absolutely tell you not to go to a doctor unless it’s got a prompt trying to keep it from saying that.
No, they will not, actually. At least in the lower-middle, Medicaid spectrum of care.
Physicians at this level serve the labor market, and everything they do is ordered to keeping you able-bodied and working and on the job and not taking sick days.
They have always been reluctant to dispense excuses from work, to document any sort of disability related trouble, or to give advice that does not involve more exercise or more medication.
I have honestly approached multiple doctors in multiple health systems to complain about fatigue, syncope, and exhaustion, and they all go to extraordinary lengths to deny that anything is wrong or that anything should change.
I use Medicaid and have never found that to be the case.
In fact, I’d prefer to be back on full work duties, but I’m still on light duty until my next appointment, assuming he restores me to full duty.
Providers who are reimbursed via Medicaid have zero incentive specific to Medicaid to not give out work “excuses”, document disabilities and so on. Generally speaking the same providers serve both Medicaid and non-Medicaid patients.
Perhaps there are some states where the Medicaid providers are really awful (California?).
Which begs an important question. To what extent is American healthcare built around liability?
There are a lot of innocuous things built into the healthcare system, like gating the cough syrup that gets you really high easily, behind a visit to a provider, that many people are just ignorant of as a virtue of the current setup.
When we disrupt fields that we are not experts in we often miss oceans of nuance.
I’ve used GPT for personal health research—nothing serious, just to help clarify questions before a doctor visit. It’s useful, but honestly? Also unsettling. The authority it projects is easy to believe, even when it’s just confidently wrong. HealthBench seems like a small step toward grounding that power in something real. It doesn't fix the problem, but at least it admits it exists.
I would really rather like a benchmark purely focusing on diagnosis. Symptoms, patient history vs the real diagnosis. Maybe name this model House M.D 1.0 or something.
The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.
Is the Med-PaLM model that Google's has been working on meant to be considered for comparison ? If I'm not mistaken, it isn't publicly available.
> https://sites.research.google/med-palm/
Med-PaLM is old and has been superseded by (multiple generations of) Gemini.
Why are all the label colours for the "Worst-case HealthBench score at k samples" chart the same colour and the same shape? Completely unreadable.
The colors are slightly different. I think the person why made this was more concerned with appearance and not providing information.
Most probably I’m going to get downvoted, but I’m gonna say it…
It’s a pity they don’t support Greek language, keeping in mind that almost all medical terminology has Greek origins.
Anyhow, this is a step in the good direction and for sure it will aid many people looking for medical assistance via ChatGPT.
i have zero trust in openai's ability to do anything impartially. why should we leave the judgement of a private tool up to the makers of the tool especially when human lives are at stake?
I can already see the pharma salesmen drooling at the idea of how various symptoms can be marketed to.
"As an AI agent, a possible diagnosis is [xxx]. Ask your doctor about / look into [yyy™] for a possible solution!"
And OpenAI is definitely thinking about this on their end:
https://www.axios.com/2024/12/03/openai-ads-chatgpt
> OpenAI CFO Sarah Friar told the Financial Times that OpenAI is weighing the inclusion of ads in its products but wants to be "thoughtful about when and where we implement them."
Ad spots inside chatgpt are going to be worth an obscene amount of money.
I agree - we should exercise a bit of caution here. There is no way they would release a benchmark which makes their model look bad. But then again we know that their models are one of the best for other uses so its not a big leap to accept this benchmark.
I think that the damage of "chatgpt misdiagnose X as Y, person dies of Z" would be quite bad for PR
Sam Altman does not care about “improving human health”
[dead]
[dead]
[flagged]
I don't want to be a conspiracy theorist, but could this be in preparation for Amazon's (to be) health branch?