My last several posts drew inspiration from materials donated by the Mississippi Board of Medical Licensure to the FSMB Historical Collection. I used a representative sample set drawn from license applications submitted to that board as a means of exploring the demographics of the early physician licensees in Mississippi. My last blog post addressed test construction and performance data on the Mississippi exam spanning the period from 1924-1958.
Today’s post focuses on the actual questions presented on the Mississippi exam. As my colleague Cyndi Streun, D. Eng., and I perused the two-volumes containing all the Mississippi board’s test questions spanning 1924-1958 any number of questions came to mind. We wondered if these were quality test items? Did they require candidates to apply relevant medical knowledge? Or were they exercises in recall asking for factoids and medical trivia? In essence, we wondered “How good was the exam?”
To answer these questions, we sought formal review from a group of subject matter experts (SMEs): licensed physicians. The four physicians who participated with us all held ABMS certification in either family medicine or internal medicine; and they all had experience in assessment through their service as committee members working on the United States Medical Licensing Examination® (USMLE®).
The sheer volume of questions available for review presented both an opportunity and a quandary. There were more than 3,900 test questions available to us with three hundred or more items in each of the subject areas, e.g., anatomy, physiology, surgery, etc. Even reviewing all the test items from a single year meant asking an SME to read and critique roughly 80 or more questions. In order to make this task more manageable for these SMEs volunteering their time, we settled upon a small subset of questions from a single subject area.
The SMEs reviewed 56 physical diagnosis questions drawn from multiple administrations of the Mississippi board’s medical licensing exam panning over thirty years. We pulled items from this subject area of the test administrations from 1925, 1930, 1935, 1940, 1945, 1950 and 1955. Each SME reviewed these questions independently. For every test item they applied a Yes/No response to a series of prompts.
Is the question clinically relevant?
Is the question appropriate for inclusion on a medical licensure examination?
Does the question involve clinical reasoning?
If the SME answered affirmatively for clinical reasoning, they were then asked to assign a specific level from Bloom’s taxonomy (i.e., remembering, understanding, analyzing, applying, evaluating, creating) to their rating of the question.
This methodology allowed us to provide a consensus score for the SMEs’ collective rating of each test item. Perfect agreement with all raters answering yes would be a 1.0 score for that test item. Three of four raters saying yes equaled 0.75, etc. The same approach was used to attach a score to appropriateness and whether clinical reasoning was required. As an example, perfect agreement among the raters for clinical relevance would be a score of 8 in 1925.
There was no perfect consensus among the four raters for any of the three categories in any of the years in this study. In general, the SMEs retrospective review deemed the Physical Diagnosis questions clinically relevant (overall .85) and appropriate for a medical licensing exam (overall .78). See Table 1.
Table 1: SME evaluation of clinical relevance, appropriateness and clinical reasoning on 56-test item set drawn from Mississippi’s medical licensing exam
| # items | Relevance points | Appropriateness points | Clinical reasoning points | |
| 1925 | 8 | 5.75 | 5.5 | 3 |
| 1930 | 7 | 6.25 | 6 | 4.75 |
| 1935 | 8 | 7 | 6.75 | 4.25 |
| 1940 | 8 | 7 | 5.75 | 4 |
| 1945 | 9 | 8.5 | 7.75 | 5 |
| 1950 | 8 | 7.5 | 6.25 | 5.25 |
| 1955 | 8 | 6 | 6 | 3.5 |
| total | 56 questions | 48 pts | 44 pts | 29.75 pts |
| 0.857 avg | 0.785 avg | 0.531avg |
The SMEs judged clinical reasoning as the weakest of the three domains considered. Indeed, this area proved challenging for the SMEs themselves as it is clear that some did not view the lowest levels on Bloom’s taxonomy (remember, understand) as falling within their own definition of clinical reasoning in a medical context.
We quantified the application of Bloom’s taxonomy by identifying instances in which two SMEs marked the same question as requiring a higher clinical reasoning (analyze, apply, evaluate). Only 12 of the 56 test items met this threshold.
To the extent clinical reasoning was deemed by the SMEs as being required of the examinee to answer the question, it was generally seen as occurring at the lower levels on Bloom’s taxonomy (remember, understand). The SMEs assigned the lowest levels (remember, understand) on Bloom to 55% of the total 56 test items.

So what might we surmise for all of this? Several things but first let’s be clear on the limitations of what I’ve shared. This project involved a subset of test items from a single subject area (Physical Diagnosis). More ambitious work might tackle all 300+ test items from a single subject area…or all the questions posed from a single year’s test administration…or from multiple administrations of the Mississippi exam. And while there is nothing to suggest that Mississippi was unique in the construction and content of its licensing examination, this remains a snapshot from a single state.
Having said all this, several things are worth stressing. It seems fair to say that our SMEs had generally positive views for the appropriateness of the questions being asked and their clinical relevance. In essence, what was being asked seemed reasonable even though the questions did not consistently and routinely require candidates to apply their knowledge.
It is also fair to say that these test items (and the examinations in general) were very much products of their time. On multiple occasions the SMEs called out “antiquated” or “dated” language or that today one would apply a “more specific diagnosis.”. The SMEs flagged instances in which a question might be far less relevant today than at the time it was used on this exam. For example, a question diagnosing typhus or rheumatic fever today (compared to the 1930s) might feel like “esoterica.” Additionally, imaging and lab tests would be the standard approach today for conditions previously diagnosed through physical findings several generation ago.
Indeed, one of the major challenges that arose early in working with the SMEs was agreeing upon the lens through which their evaluation should be made—specifically, was the SME rating these questions for appropriateness, clinical relevance, etc. through the lens of medicine as taught and practiced today? Or through the lens of what was understood and practiced at the time of the test? Ultimately, we directed the SMEs to evaluate the test questions through the lens of medicine today. The rationale being that we did not expect our SMEs to be historians of medicine evaluating each item through the lens of medical knowledge known at specific points in time spanning mid-20th century America.
A final element of this study involved using an AI-model to generate answers to each of the 56 physical diagnoses questions drawn from the Mississippi exam. We then asked our four SMEs to rate the AI-generated answers. Our SMEs evaluated the AI-developed answers based upon the following criteria:
Did the answer show a clear understanding of the question?
Was the answer clear, concise and without irrelevant information?
Did the answer use appropriate language and terminology?
Did the answer provide factually correct information?
Did the answer show evidence of clinical reasoning?
Our SME human raters scored AI generated answers to the dataset questions highest in providing factually correct information, using appropriate language and avoiding irrelevant information. Our SMEs showed markedly different views of the AI answers in their demonstrating clear understanding of the question. Finally, and perhaps not surprisingly, our SMEs scored the AI-generated answers lowest for showing evidence of clinical reasoning. (see below)


Final reflections:
Mississippi’s approach to assessment was conventional and consistent with long-standing state board methods (extended response test items). It was also vulnerable to the deficiencies inherent to that format, e.g., reliability. The board members were not measurement scientists and perhaps not even expert in the subject area(s) they were asked to write test items.
The questions posed on the Mississippi exam were generally relevant and appropriate to a licensing examination. They may not have required the physician candidates to consistently apply a high degree of clinical reasoning but, based upon my familiarity with medical licensing exam questions in other states, the Mississippi exam seems no better or no worse in that regard than other states.
Perhaps most important was the standard being applied as evidenced by the pass rate on the Mississippi exam. It seems clear that regardless of what was being asked on the exam, the depth and breadth of knowledge deemed acceptable as evidence of proficiency was an easily achievable hurdle for nearly all the Mississippi candidates in most years. Remember, Mississippi passed every candidate tested on their exam in 23 of the 35 years in this study. Only 1.3% of all their candidates were failed by the Mississippi board.
Finally, we’ve all heard an iteration of the quote describing medicine as a blend of science and art. If so, it seems that testing during the era of state medical licensing exams probably belonged as much to the latter as it did to the former.
The opinions expressed are those of the author and do not represent the views of the Federation of State Medical Boards.







































