AI in Medicine — Importance of Prospective Study; Disparity between Retrospective and Prospective Result
When Go champion was defeated by AlphaGo in 2016, everybody dreamed that artificial intelligence would replace experts in medicine and replace everything in the world. The achievements of AI in the medical field had been recorded one by one in an IEEE Spectrum (“AI versus Doctor”; https://ieeexplore.ieee.org/document/8048826).
However, the main focus has moved to the role of artificial intelligence as an assistance tool for experts, not replacement. In addition, it is not uncommon to hear that artificial intelligence is not making a profit in business. Even IBM’s Watson was sold with some criticism. There may be a problem in some way, so why are we hearing these news?
In 2018, our group published that our AI outperformed dermatologists in diagnosing onychomycosis. At that time, an article (https://spectrum.ieee.org/ai-beats-dermatologists-in-diagnosing-nail-fungus) also published in IEEE Spectrum. The title “AI Beats Dermatologists in Diagnosing Nail Fungus” shows the atmosphere at the time when AI is likely to dominate all fields.
A reader test (=quiz test) was conducted to predict whether the given images were onychomycosis or not by examining the nail photos. The graph above shows that no dermatologist did better than the AI among 40 dermatologists participated. This result was shocking because the best dermatology professors in Korea who majored onychomycosis were participated. From this point of view, we thought that AI will soon conquer medicine, but it still hasn’t.
This overwhelming result is the result of “quiz” that tested whether photographs are onychomycosis or not. For the AI to be used in real world, the results of AI should be reproduced in the real-world setting. To confirm the performance, we conducted a prospective study with the same onychomycosis algorithm in 2019 (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234334) and obtained the following results.
Roughly, the performance of AI was at the level of the participating specialists. However, the specialist participated this time was not the onychomycosis specialist, but the clinical instructor in Dermatology. In the former retrospective study, the same algorithm overwhelmed the most experienced dermatologists in onychomycosis, but the actual performance in the prospective study was on par with that of the general dermatologists.
The problem is that the analysis of onychomycosis image is a very advantageous for AI. It is easy to unify the composition of the photograph of nail plate. In addition, it is only necessary to make a binary decision (onychomycosis or not; binary classification), not various diagnoses (multi-class classification). In addition, the model was trained with about 50,000 images reviewed and revised several times for the binary classification. It was difficult to improve the performance significantly by using more cutting-edge model or adding more clinical images. After this result, I came to the conclusion that the performance of AI in medicine would not easy to be satisfactory in the prospective setting.
In 2018, we published that the classification of 12 skin tumors of the algorithm could be done at the dermatologist level (http://www.jidonline.org/article/S0022-202X(18)30111-8/fulltext). However, a rebuttal letter was immediately published stating that the performance of the model was not at that level (https://www.jidonline.org/article/S0022-202X(18)31991-2/fulltext). As such controversy has arisen, it is true that AI is good at quiz solving on the paper, but when other people use the same DEMO in the real-world setting, the performance is far below expectations. If so, why and how big is this disparity?
The above result (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003381) shows why the AI that is good in the quiz is not that cool in the real world setting. In conclusion, doctors have much better diagnostic ability in “actual practice” than in “quiz show”. In the graph above, at the same specificity, the attending dermatologists who performed an in-person examination have almost 25% higher sensitivity than the dermatologists who diagnosed only with photographs. At the pilot test of this study, the diagnosis rate of the clinical instructors who participated in the quiz was too low. I suspected that they did not solve the problem diligently, so I forced for them to review the same questions several times. However, the fact is that the diagnosis rate of dermatologists in reader test is much lower than our expectation. Dermatology is not a simple examination of visual finding.
The retrospective result in medicine may be exaggerated in the AI research. In fact, a review study (https://www.thelancet.com/journals/landig/article/PIIS2589-7500(19)30123-2/fulltext), pointed out that only about 20 prospective studies exist out of 20,000 AI studies.
So, what is the actual performance of AI in dermatology? In this regard, we recently conducted a prospective RCT (randomized controlled study) trial (https://www.jidonline.org/article/S0022-202X(22)00122-1/fulltext). For the RCT study, we used the algorithm at top-class level for (a) the multi-class classification of 134 diseases and (b) the binary classification of determining whether it is cancer or not.
In this study, the actual Top-1 accuracy (probability that the first diagnosis presented is correct) was at the level of the 1st grade dermatology resident. Fortunately, the Top-3 accuracy (the probability that the correct answer is within the 3 presented diagnoses) was at the level of attending dermatologists. As mentioned earlier, considering that the retrospective performance of the algorithm used was at the level of an experienced top-class dermatologist, the result is also a step down.
In this study, we not only investigated the standalone performance of AI, but also checked whether the AI could augment the diagnosis ability of doctors. At the result, the accuracy of the general physicians with the least experience improved the most, whereas the accuracy of the dermatology residents did not improve. As for the Top-3 diagnosis ability, AI was at the level of dermatologists and was superior to that of first-year residents, so in the Top-3 accuracy part, the AI could improve the diagnostic ability of the 1st grade dermatology residents. In other words, AI only helped if it outperformed users in the prospective setting. As the competitiveness of the AI was on the Top-3 accuracy, therefore, the AI may be more suitable for providing various disease information rather than accurately predicting it.
There are various reasons why AIs show good performance in retrospective studies, but not in the prospective setting. (a) in-person examination is much more accurate, (b) good benchmark results by cheating (Clever-Han type bias), (c) OOD (out-of-distribution) problem that AI can not solve at all the untrained problem (https://jamanetwork.com/journals/jamadermatology/article-abstract/2784298).
AI exists for prediction. In traditional statistics, we can get an insight even if there is a bias in the retrospective data, but this insight does not exist in the AI research. AI should be able to predict the prospective problem. Because there are some discrepancy between the prospective and retrospective performance in medical AI research, it is difficult to trust the results of the retrospective data, especially for subjects with many variables.
In order to overcome these problems in the future, first, it is important to narrow and select the problems that AI can work. Onychomycosis is a insignificant medical problem compared to other life-threatening disorders. However, the background is constant, the composition can be easily unified, there is no racial difference, was chosen because we can obtain a lot of nail plate images. Similarly, the problems with unified the composition (lip diseases — https://onlinelibrary.wiley.com/doi/abs/10.1111/bjd.19069) and that can reduce the dimension by narrowing the conditions or sites such as sexually transmitted diseases (https://mhealth.jmir.org/2020/11/e16517) is good approach to overcome the current performance issue.
Second, we must be able to handle data well using domain knowledge. Data Scientist should not end with a simple end-user of Tensorflow, PyTorch, and Python. We need to generate consistent data with the supervision of data scientist, like an orchestra conductor. Recently, to improve the performance of AI, efforts to generate synthetic data like NVidia (https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data/) or revising data (Andrew Ng’s Data-Centric AI; https://spectrum.ieee.org/andrew-ng-data-centric-ai) are frequently mentioned. Unless something new comes out, it may not easy to improve performance technically (e.g. model to use and parameters for training).
Third, we need to be good at prospective research design and have experience with it. Unlike retrospective studies, prospective studies have a variety of unexpected variables, and some mistakes may lead to unreliable results. AI exists to predict the future. Even in small-scale studies, it is important to repeat prospective studies to recognize and improve problems while gaining experience in research design. By repeating the prospective study, the diagnostic characteristics of AI can be identified.
In summary, it doesn’t matter whether the AI perform in the quiz show. It is important that the AI works in the real-world setting. Furthermore, AI should be able to change the decision of doctors or patients. In the future, such a change should lead the improvement in survival rates and medical costs. However, since the discrepancy between the results of prospective and retrospective research is high, it is necessary to narrow down the scope of the problem, and we need to make a lot of effort to improve the data with domain knowledge.
REFERENCE — OUR WORKS
- Assessment of Deep Neural Networks for the Diagnosis of Benign and Malignant Skin Neoplasms in Comparison with Dermatologists: A Retrospective Validation Study. PLOS Medicine, 2020
- Performance of a deep neural network in teledermatology: a single‐center prospective diagnostic study. J Eur Acad Dermatol Venereol. 2020
- Keratinocytic Skin Cancer Detection on the Face using Region-based Convolutional Neural Network. JAMA Dermatol. 2019
- Seems to be low, but is it really poor? : Need for Cohort and Comparative studies to Clarify Performance of Deep Neural Networks. J Invest Dermatol. 2020
- Multiclass Artificial Intelligence in Dermatology: Progress but Still Room for Improvement. J Invest Dermatol. 2020
- Augment Intelligence Dermatology : Deep Neural Networks Empower Medical Professionals in Diagnosing Skin Cancer and Predicting Treatment Options for 134 Skin Disorders. J Invest Dermatol. 2020
- Interpretation of the Outputs of Deep Learning Model trained with Skin Cancer Dataset. J Invest Dermatol. 2018
- Automated Dermatological Diagnosis: Hype or Reality? J Invest Dermatol. 2018
- Classification of the Clinical Images for Benign and Malignant Cutaneous Tumors Using a Deep Learning Algorithm. J Invest Dermatol. 2018
- Augmenting the Accuracy of Trainee Doctors in Diagnosing Skin Lesions Suspected of Skin Neoplasms in a Real-World Setting: A Prospective Controlled Before and After Study. PLOS One, 2022
- Evaluation of Artificial Intelligence-assisted Diagnosis of Skin Neoplasms — a single-center, paralleled, unmasked, randomized controlled trial. J Invest Dermatol. 2022