NLPs such as ChatGPT are novel sources of online healthcare information that are readily accessible and integrated into internet search tools. The accuracy of NLP-generated responses to health information questions is unknown. We queried four NLPs (ChatGPT 3.5 and 4, Bard, and Claude 2.0) for responses to simulated patient questions about inguinal hernias and their management. Responses were graded on a Likert scale (1 poor to 5 excellent) for relevance, completeness, and accuracy. Responses were compiled and scored collectively for readability using the Flesch-Kincaid score and for educational quality using the DISCERN instrument, a validated tool for evaluating patient information materials. Responses were also compared to two gold-standard educational materials provided by SAGES and the ACS. Evaluations were performed by six hernia surgeons. The average NLP response scores for relevance, completeness, and accuracy were 4.76 (95% CI 4.70-4.80), 4.11 (95% CI 4.02-4.20), and 4.14 (95% CI 4.03-4.24), respectively. ChatGPT4 received higher accuracy scores (mean 4.43 [95% CI 4.37-4.50]) than Bard (mean 4.06 [95% CI 3.88-4.26]) and Claude 2.0 (mean 3.85 [95% CI 3.63-4.08]). The ACS document received the best scores for reading ease (55.2) and grade level (9.2); however, none of the documents achieved the readibility thresholds recommended by the American Medical Association. The ACS document also received the highest DISCERN score of 63.5 (57.0-70.1), and this was significantly higher compared to ChatGPT 4 (50.8 [95% CI 46.2-55.4]) and Claude 2.0 (48 [95% CI 41.6-54.4]). The evaluated NLPs provided relevant responses of reasonable accuracy to questions about inguinal hernia. Compiled NLP responses received relatively low readability and DISCERN scores, although results may improve as NLPs evolve or with adjustments in question wording. As surgical patients expand their use of NLPs for healthcare information, surgeons should be aware of the benefits and limitations of NLPs as patient education tools.