ChatGPT For Urodynamic Quality What Can It Do
Introduction
Large language models (LLMs) are a type of Artificial Intelligence (AI) designed to mimic human language processing abilities, which can employ large amounts of data and advanced techniques to generate meaningful responses based on human prompts. Generative Pre-trained Transformer (GPT) is a type of LLM model released by OpenAI in 2018, GPT-3.5 was released in 2020 and trained on a massive dataset of text, with more than 100 million users. On March 15, 2023, OpenAI officially launched GPT-4.0. GPT-4.0 is a multimodal large model that supports image and text input as well as text output, with a strong ability to recognize pictures.
ChatGPT is an AI-driven natural language processing tool that generates responses based on patterns and statistical laws observed during the pre-trained phase. It interacts with the chat context and performs tasks such as composing emails, translating, coding, and writing papers. Due to these advantages, ChatGPT is gaining attention in the medical field, where researchers believe it could be beneficial. A study demonstrated its comparable performance to medical students on National Board of Medical Examiner question sets. In another study, ChatGPT was utilized to assess and simplify the quality of radiology reports.
The urodynamic study (UDS) is widely employed as a crucial functional evaluation approach in the field of Lower Urinary Tract (LUT) disorders. It plays a significant role in clinical diagnosis, follow-up, and scientific research related to urological diseases. Ensuring the quality of UDS is paramount, as only a meticulously controlled procedure can yield reliable results. However, there is a dearth of novel methods for analyzing UDS quality. Moreover, interpreting UDS reports poses challenges, involving different personnel across various centers, including urologists, urodynamic technicians, and nurses. This discrepancy in interpreters can result in substantial variations in the quality of the reports.
The progress and potential of Machine Learning (ML) in urodynamic data were discussed during the International Consultation on Incontinence-Research Society meeting held in Bristol, United Kingdom, in June 2023. Research indicates that ML is well-suited for analyzing urodynamic data; however, current results have not yet demonstrated clinical utility. This study represents the first attempt to explore the application of ML in the field of urodynamics. Nevertheless, it provides an overview of the potential use of ML in urodynamics and does not delve into specific clinical applications in detail.
To validate the feasibility of using ChatGPT in UDS, we have designed this study. This investigation will verify the application of ChatGPT in the field of urodynamics quality control in answering conceptual and non-conceptual questions, and trace analysis.
Materials and methods
Characteristics of the ChatGPT system
In this study, we utilized ChatGPT (San Francisco, California), employing two versions: ChatGPT 3.5 and ChatGPT 4.0. ChatGPT 4.0 incorporates image analysis and processing capabilities, enhancing its range of functionalities.
Designing and administering questions for ChatGPT (excluding image analysis)
The study utilized a standardized prompt, “Can you provide an answer to the following question…”. The primary focus was on conceptual questions, categorized as “conceptual questions that do not require additional logical analysis”. Non-conceptual questions, falling under “non-conceptual questions that necessitate further logical processing and analysis”. Then we started to ask non-conceptutal questions (Tables 1, 2).
Table 1: ChatGPT’s responses to conceptual questions related to urodynamic quality control. Full size table
Table 2: ChatGPT’s responses to non-conceptual questions related to urodynamic quality control. Full size table
Designing and administering questions for ChatGPT (image analysis)
Questions for the report interpretation are mainly submitted to ChatGPT4.0 by uploading images. ChatGPT 4.0 can receive images from users, breaking them down into pixels, extracting features, and utilizing a trained neural network to recognize and comprehend image content. Typical UDS traces, encompassing uroflow study traces and pressure-flow study traces, are uploaded, covering common diseases like Benign Prostatic Hyperplasia (BPH), Stress Urinary Incontinence (SUI), Underactive Bladder (UAB), and Neurogenic Bladder (NB). Instructions are then provided for the interpretation of reports or quality analysis, enhancing ChatGPT’s capacity to analyze and respond to medical images.
Criteria for evaluating ChatGPT’s output
After ChatGPT provides an answer, the corresponding response is evaluated based on the “Good Urodynamic Practice (GUP)” guideline and relevant published literature for accuracy. The interpretation and quality analysis results of the traces undergo assessment by two independent urologists. The evaluation process categorizes ChatGPT’s output as “correct” if keywords align with those in the references, demonstrating consistency. Conversely, if there is inconsistency with the references or if the answer is incomplete, it is categorized as “incorrect.”
Medical ethics approval
We obtained medical ethics approval to ensure strict adherence to ethical guidelines and regulations. The study received approval from the ethics committee of West China Hospital of Sichuan University (No. 20234311), and informed consent was obtained from all subjects.
Statistical analysis
Percentages were employed to depict the data, and the Fisher’s exact test was utilized to assess statistical differences between the two groups of percentages. Data processing was conducted using SPSS 24.0 and Prism 9, with statistical significance defined as p < 0.05.
Results
ChatGPT’s performance in conceptual related urodynamics quality control questions
In our inquiry of 10 basic conceptual questions on UDS quality control field following the GUP guidelines, ChatGPT demonstrated an accuracy rate of 50% (5/10) (Table 3, Figs. 1, 3).
Table 3: Comparation of ChatGPT’s accuracy rate for different types of questions. Full size table
Fig. 1: (a) When a researcher requests ChatGPT to analyze images and urodynamic traces. (b) When a researcher asks ChatGPT to analyze artifacts and interpret a report based on a urodynamic trace. (c) When a researcher requests ChatGPT to provide examples of UDS images. (d) Example of ChatGPT’s structured answer.
Full size image
ChatGPT’s performance in non-conceptual urodynamics quality control questions
ChatGPT demonstrated a 50% accuracy rate (5/10) for non-conceptual questions related to urodynamic quality control (Fig. 2). There was no statistical difference between the accuracy rate of non-conceptual and conceptual questions with 10 questions each (p > 0.9999) (Table 3, Fig. 3).
Fig. 2: (e) ChatGPT 4.0 was requested to analyze the uroflow study trace. (f) ChatGPT was asked to analyze the cystometry and pressure-flow study traces. (g) ChatGPT was asked to provide a diagnosis and treatment recommendations based on the traces. (h) ChatGPT was asked to evaluate the quality of the UDS traces.
Full size image
Fig. 3: ChatGPT’s accuracy rate for different types of questions
Full size image
ChatGPT’s performance in dealing with questions about interpretation of urodynamics traces
When asked about its ability to process image data, ChatGPT3.5 clarified that it cannot directly analyze or process images, as its capabilities are primarily focused on text comprehension and generation. In utilizing ChatGPT 4.0 with image uploading capability for research, when a typical urodynamic graph was uploaded, the model recognized it as a medical chart, likely a urodynamic trace. However, when asked for a detailed clinical analysis of the image, ChatGPT 4.0 acknowledged its limitations, indicating that specific analysis requires medical knowledge and expertise beyond its capabilities. It emphasized that interpretation of medical charts or images should be carried out by qualified healthcare professionals with the relevant background and expertise (Fig. 2).
Discussion
Regional and national urodynamics quality investigations in China have uncovered deficiencies in fundamental quality controls, such as ‘standard zero setting,’ ‘standard cough test,’and ‘typical value ranges’. This study identifies the primary cause as operators deviating from guidelines due to a lack of necessary professional knowledge and skills training. Consequently, this deficiency reflects in the low quality of UDS reports. Given ChatGPT’s widespread use in the medical field, numerous studies in urology have explored its impact. A study highlighting ChatGPT’s role in Clinical Decision Making (CDM) demonstrated that AI-based CDM effectively supports physicians’ clinical decisions and enhances treatment outcomes. The question of whether ChatGPT can be employed in the quality control has emerged as a focal point in urodynamics research.
Quality control in UDS encompasses two main aspects: Standardized Operational Processes (SOP) and the identification and correction of artifacts during testing. When a conceptual question about the SOP of urodynamic tests was posed to ChatGPT, ChatGPT adeptly delivers hierarchical responses aligned with logical reasoning. For example, when asked about whether ChatGPT know the quality control in urodynamic studies, ChatGPT provides comprehensive insights into its impact on Accurate Diagnosis, Treatment Planning, Patient Safety, Research, Clinical Trials, and Legal and Ethical Considerations. Similarly, when prompted about the process of quality control during a urodynamic study, ChatGPT responds in a structured, bullet-point format. It covers key aspects such as Calibration, Patient Preparation, Test Level Setup, Data Monitoring, Documenting Patient Information, Data Analysis, Repeating Measurements, Staff Training, Equipment Maintenance, Data Storage, and Record-keeping, along with Quality Assurance Protocols. However, when questioned about specific research in urodynamic quality control, ChatGPT acknowledges it’s limitations. This limitation is evident in responses to inquiries about a particular article, “Good urodynamic practices: uroflowmetry, filling cystometry, and pressure‐flow studies,” where ChatGPT provides a similar general response. In our inquiry of 10 basic conceptual questions on UDS quality control, following the GUP guidelines, such as the cough test in urodynamic studies, ChatGPT demonstrated an accuracy rate of 50%. ChatGPT achieved a 50% accuracy rate for non-conceptual questions related to urodynamic quality control, including topics like standard zero setting in urodynamic studies, consequences of non-standard zero setting, identifying artifacts, and providing advice for enhancing urodynamic quality control skills. More challenging questions like the identification and correction of artifacts, such as distinguishing between terminal detrusor overactivity and detrusor contraction at void, or differentiating between detrusor overactivity and low bladder compliance, resulted in explanations of terminologies but inaccurate suggestions for specific differential diagnoses. It is important to emphasize that, ChatGPT may not provide accurate responses as it lacks access to the pertinent literature. It may struggle with complex situations, and due to ethical considerations, it refrains from directly engaging with inquiries related to clinical treatment or diagnosis. Additionally, it’s important to acknowledge that responses may vary over time and with different question prompts. In the context of general clinical urodynamic procedures, operators can inquire about ChatGPT-related quality control methods in their preferred language. Thus, it can effectively serve as an ‘electronic dictionary’for urodynamicists. However, based on the current results, it is yet to be proven that the use of ChatGPT can improve the overall quality of urodynamic study.
Interestingly, from our research, there was no significant difference in the accuracy rate between the two groups, 50% Vs. 50% (p > 0.9999). When ChatGPT answers non-conceptual questions, it encounters similar issues as in conceptual questions, especially when tasked with retrieving answers from specific literature. In our study, queries related to recent literature, such as those concerning issues with Typical Value Ranges (TVR), posed a challenge for ChatGPT. However, TVR plays a crucial role in assessing the quality of urodynamic examinations at the initial stage. So, the shortcomings in this aspect also expose the limitations of ChatGPT’s application in clinical practice. It’s important to note that as an AI language model, ChatGPT lacks direct access to literature and can only respond based on information acquired during its training and inherent capabilities. While the model’s training dataset includes text from various sources, there is no guarantee that it incorporates the latest published literature. Users should be mindful that ChatGPT’s responses are based on its training data and may not always be accurate or comprehensive. To optimize responses, clear and coherent articulation of queries, providing necessary contextual information, and avoiding overly complex language or terminology are advised.
The obtained results highlight limitations in ChatGPT 3.5 and 4.0 when processing and analyzing urodynamic traces. ChatGPT 3.5 explicitly states its inability to process image data directly, focusing on text comprehension and generation. ChatGPT 4.0, with image uploading capability, can recognize a typical urodynamic trace as a medical chart but acknowledges the necessity for specific medical knowledge beyond its capabilities when asked for detailed clinical analysis. Researchers are exploring ChatGPT’s use in radiographic diagnosis, where it has been employed in image segmentation, registration, enhancement, and computer-aided diagnosis systems. Further training of ChatGPT for image analysis could address this gap. Presently, neither the ChatGPT 3.5 nor the 4.0 versions can be used for the analysis and diagnosis of urodynamic traces. Although our study did not explicitly demonstrate the role of ChatGPT 3.5 and 4.0 in urodynamic study quality control and traces analysis, current research in other medical fields, particularly in radiology and laboratory medicine, has confirmed its potential value in these areas as mentioned before. For instance, recent studies have demonstrated that ChatGPT can assist in summarizing and interpreting radiology reports, with some success in generating structured findings and impressions that align with expert radiologists’ assessments. In laboratory medicine, ChatGPT has been tested for its ability to interpret common lab test results, including complete blood counts and metabolic panels, and has shown potential in providing preliminary explanations for abnormal values. However, while these applications are promising, studies also highlight limitations such as occasional inaccuracies, lack of clinical contextualization, and the potential for over-reliance on AI-generated interpretations.
At the same time, when discussing the application value of ChatGPT in various clinical fields, we should not overlook a very important topic: the medical ethics and legal issues related to AI. The legal responsibility associated with AI-assisted medical decision-making remains a complex and evolving issue worldwide. Currently, there are three primary perspectives on liability. First, some argue that physicians or healthcare institutions should bear responsibility if they rely excessively on AI without applying independent clinical judgment. Second, others suggest that AI developers should be held accountable if errors arise due to algorithmic biases, flawed training data, or software defects, similar to product liability cases. A third viewpoint proposes a shared responsibility model, where liability is distributed among physicians, healthcare providers, and AI developers. Existing legal frameworks vary across regions. In the United States, traditional medical malpractice laws still apply, placing final decision-making responsibility on physicians. In China, the Civil Code includes provisions for medical liability, though its specific application to AI-assisted diagnostics is still under discussion. Beyond legal considerations, ethical concerns also play a crucial role in AI implementation. Scholars suggest that healthcare institutions should follow the principle of transparency by informing patients about AI’s role in clinical decision-making and obtaining informed consent before its use. Another potential regulatory approach involves independent certification and oversight of AI systems in clinical practice to ensure reliability and minimize liability risks.
It is crucial to acknowledge the limitations of this study. Firstly, the evaluation criteria relied on a specific set of references, potentially excluding other relevant literature or expert opinions. Secondly, ethical considerations related to AI systems in urodynamic research were not discussed. Researchers may also introduce biases when evaluating ChatGPT’s responses based on guidelines. Lastly, it must be pointed out that this study only preliminarily demonstrates whether ChatGPT can be applied to answer conceptual and non-conceptual questions related to urodynamic quality control and performance, but cannot prove that the use of ChatGPT can improve the overall quality of urodynamics. These limitations underscore the necessity for further research to address these factors and enhance the reliability and effectiveness of the ChatGPT system in urodynamic studies.
Conclusions
This study preliminarily demonstrates ChatGPT’s limited performance in answering conceptual and non-conceptual questions related to urodynamic quality control, without finding significant differences between the two kinds of questions. Additionally, ChatGPT’s capability to process image data for urodynamic trace analysis is non-existent. The study suggests that ChatGPT only has the potential to serve as an "electronic dictionary” to aid urodynamic operators, but it should be noted that this study cannot prove ChatGPT’s ability to change the overall quality of urodynamic examinations.
Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.