Abstract

Modern generative artificial intelligence (AI) models typified by StableDiffusion1 (generating images) and GPT32 (generating text) are among the most exciting developments in AI research. These models have captured the imagination of the general public and have resulted in an explosion of AI-generated art and text while sparking a broader conversation around the presence of AI within the public domain and specific communities (most notably the art community) all of which stand to be reshaped by these technologies. Over a century ago, Oscar Wilde wrote that “There is no such thing as a moral or an immoral book. Books are well-written, or badly written. That is all.” Although this may yet hold true for the realm of art, radically different standards exist for science and medicine where veracity is paramount. Galactica,3 Meta's recently recalled model that “wrote” scientific papers, was found after release to harbor racial biases, cite fictional sources, and authoritatively assert inaccurate facts replete with fake citations.4 This recent controversy came as no surprise to many critics in the AI field who have taken to referring to large language models such as Galactica and GPT3 as “stochastic parrots.”5 Critics of these models emphasize that the models have learned to capture the statistical properties of language without any underlying referents or ground truth. Therefore, it should come as no surprise that writing generated by these language models can superficially read well while lacking underlying substance. The latest in this line of generative language models is OpenAI's ChatGPT,6 a derivative of GPT3 that was subsequently fine-tuned to function as a chatbot. ChatGPT's incredible capabilities derive specifically from this two-part construction, taking advantage of GPT3's incredible modeling of human language while fine-tuning it to generate new text that is conversational and responsive to human feedback. In their paper in this issue of Neurosurgery, D'Amico et al7 asked the obvious question, “How can ChatGPT be incorporated into neurosurgical practice?” Appropriately, they asked this question of ChatGPT itself and reported ChatGPT's response as well as their own commentary on it. Chatbots themselves have a long history in medicine going back to the 1980s,8 but as the authors note, there is a surge in modern applications built on language model technologies such as ChatGPT. It is the very stochastic nature of these language model-driven chatbots that impresses users who are amazed by the fluency of their responses. It is this same stochastic nature that harbors some of their major weaknesses. Older, rule-based systems could be thoroughly vetted and limited to a narrow range of responses (“please wait, while I connect you with an operator for further assistance”). Language model-based chatbots are fundamentally different. These systems learn a stochastic model to predict the next word given a series of words as inputs—they do nothing more and nothing less. Such models are intrinsically unpredictable, which poses a unique challenge to neurosurgeons seeking to use them in practice. The stochastic nature of language model-based chatbots, such as ChatGPT, makes them highly sensitive to their input prompt (the input words) and how that prompt interacts with the underlying model. If users are not familiar with the peculiarities of a given model, they might not be able to get their desired result. For example, the prompt used by D'Amico et al did not specifically ask for citations, and the generated research article unsurprisingly lacked them. Had the authors ask for an article replete with citations, ChatGPT would have provided them. Alternatively, had a different model been used such as Galactica, the article would likely have included citations without additional prompting since Galactica naturally includes citations because it was fine-tuned specifically on scientific manuscripts. The stochastic nature of these software tools and their dependency on the input prompts and the training regimens (which are often unknown to users) makes for radically different software from what we are used to in neurosurgery where we expect a highly consistent and reliable result to be delivered with every interaction. All articles have limits and constraints, whether in the number of words, citations, authors, or other elements. All systems have brakes. When researchers write, they are selective in which topics they do or do not discuss, promote or ignore, or cite, providing a reference resource for subsequent readers. All these are conscious choices sometimes based on thoughtful decisions and sometimes not (just citing articles that a prior article on the same topic seemed to cite). Can a computer do this better? There is no doubt that AI can provide efficiencies in this area, but with any source of assistance, the work needs to be reviewed by the authors. We see articles now with errors or oversights that clearly indicate that the report was not read by all the listed authors. Usually the peer-review process with a fresh set of reviewer's eyes can spot these deficiencies and recommend corrections, but not always. We can imagine that the peer-review process may get more difficult or at least need to change as well, when faced with AI-generated text. Human authors put some degree of nuance into every written sentence. Some are pure declarative statements of fact that have little, and some are interpretations of data that are almost all nuance. Readers rely on this to make up their own mind on the importance and value of a manuscripts message. Will an AI, that did not do the surgery, did not manage the complications afterward, did not speak with the family for weeks, and did not see the variability within that collection of 50 aneurysms that make up the report, be well positioned to convey what the authors truly mean? Despite these limitations, the potential for AI chatbot writing holds interesting promise. Marcus9 describes a prompt by Henry Minsky asking ChatGPT to “describe losing your sock in the dryer in the style of the declaration of independence,” which resulted in a creative response, “When in the course of household events, it becomes necessary for one to dissolve the bonds that have connected a sock to its mate and to assume among the powers of the laundry room, the separate and equal station to which the laws of physics and of household maintenance entitle it, and a decent respect to the opinions of socks requires that it should declare the causes which impel it to go missing.” ChatGPT, such as Searle in his Chinese room, has never experienced the angst of losing its socks. However, for this prompt, the result is impressive and raises the possibility that AI chatbot writing, such as most tools, can be effectively wielded in the hands of an experienced practitioner. Like human authors, where cycles of edits and mutual checking lead to a final written product, this same potential exists for AI writing as well. Although AI models may lack the nuances of human authors who have grounded referents for words, the ability of AI writing to rapidly prototype can augment and extend the capabilities of neurosurgeon authors who understand how to use them for a wide range of purposes. Generative image models already have informal courses in their utilization, and it is easy to conceive of similar work for generative language models. Perhaps future educational initiatives in using, checking, and screening AI-generated content could be a valuable means of augmenting the neurosurgical workforce. Other concerns highlighted by the authors around patient privacy, ethics, bias, and the legal liability around ChatGPT and similar tools are easy to imagine but hard to control because of the very nature of the language models that drive these chatbots. One way of constraining language model outputs is by automatically appending a prefix to user prompts to alter model behavior. For example, instructions to not browse the internet and to not be offensive were automatically appended to the authors' prompt used to generate their article. However, techniques like these are easily avoided using prompt injection whereby users can tell the model to ignore their previous directions—thereby ignoring the appended prefix that was supplied to restrict model behavior. The biases of these generative models are particularly concerning, and we currently lack easy solutions for assessing the encoded biases of these models much less controlling for them. Part of the challenge lays within the training process of these models, which universally involves scraping the internet for data (an act which itself poses ethical concerns around privacy). These large internet-scale data sets of image, text, or both are then used for model training, and in the process of modeling these data sets, the generative models absorb the biases inherent within them. The results for both generative text10,11 and image12 models are concerning, and while current efforts to investigate and control for bias are encouraging,13 more work needs to be performed on both a theoretical and translational level that ultimately may have to come down to the underlying training data itself. As with all technologies, how we use tools such as ChatGPT is as important as what the technologies actually. As the author's note, “medicine as a whole must consider the use of chatbot-generated or machine-generated content and ensure that it is appropriate for its intended purpose.” Neurosurgery is frequently at the vanguard for the use of technology in medicine, and we believe that this will be no different with chatbots as with other technologies. However, the effective, safe, and ethical use of these language model-based chatbot technologies such as ChatGPT at the present require both close human supervision and more theoretical work into their control and utilization.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call