Pathology EQA: The RESPONSE EQA Analysis Program

[Logo]

The RESPONSE EQA Analysis Program

This is an attempt to explain the the RESPONSE computer program which is used to analyse the data in the National Renal Pathology External Quality Assessment Scheme, and to provide personal performance data to all the participants.

The software package is available for other EQA schemes which wish to use it; contact the author on: pnf1@le.ac.uk

Problems

The evaluation of responses in Histopathology EQA schemes has been limited by problems which do not arise in other disciplines.

The data is textual, not numeric, and so is not easily amenable to statistical analysis.
With most diagnoses there is potential for synonyms, and, more difficult still, partial synonyms which may convey subtle additional information.
The range of possible diagnoses is vast. If a list of diagnoses was to be provided for participants to choose from, its size would rival the SNOMED code book.
There is a need to permit uncertainty in responses. A differential diagnosis list is often essential, especially if participants are provided with small biopsies and limited information on further investigations.
The `true' diagnosis may never be known. Indeed, a review of the history of medicine would indicate that even those diagnoses which today seem obvious might be viewed differently in years to come.
Not all diagnostic errors are equally important. To misclassify one completely-excised benign tumour as another may not matter, but a benign/malignant error may be disastrous.
The participants are individuals; confidentiality is therefore paramount, and notification of performance must be done with great sensitivity.

The computer program provided here attempts to provide an objective method of assessment for the responses to Histopathology EQA schemes. It does not attempt to define what is an unacceptable or dangerous level of performance; such questions are ones which must be addressed by the profession as a whole, once suitable methods of measurement have been provided.

An explanation of the Histopathology EQA computerised scoring system for participants

(Written in response to several requests and misunderstandings)

Peter Furness

There are two stages. An assessment of the popularity of the diagnoses proffered for each case can be carried out as soon as responses have all been received. This is the 'case response analysis'. In the second stage, personal 'scores' are generated for each participant (personal response analysis). This usually requires some subjective decisions to be taken, so it cannot be carried out until after the cases have been discussed at a meeting of participants.

Case response analysis

Each diagnosis from each participant is recorded by the Organiser against a list of all the diagnoses proffered. The initial analysis is little more than adding up the number of 'votes' for each diagnosis. It is complicated a little by the option that participants may offer a differential diagnosis list^*. The numbers are 'rounded' so that the total equals 10; this facilitates comparison between cases, where the number of participants may differ.

For example, if there are 20 people in a scheme, and for the first case 12 diagnose a tubulovillous adenoma, 6 identify a serrated adenoma, one calls it an adenocarcinoma and one a metaplastic polyp, the result would be:

Case 1

Adenocarcinoma of colon 0.5

Tubulovillous adenoma 6

Serrated adenoma 3

Metaplastic polyp 0.5

These diagnoses are not identified as 'right' or 'wrong' at this stage. They merely indicate the group preference - but obviously the diagnosis with the highest popularity is most likely to be correct.

Personal response analysis

Your diagnosis for Case 1 will have been recorded as one of those in the list above (unless you have given a differential diagnosis , which makes things more complicated - see below).

The 'score' you receive for your response to this case will be 'weighted' by the number given against the relevant diagnosis in the above list.

Note that these numbers can be changed first, if the participants agree - see below.

The default mechanism is to use the popularity of each diagnosis, as given above. Accepting these figures will often make sense. The most popular diagnosis is presumably a reasonable conclusion to draw from the material circulated. A less popular diagnosis (e.g. serrated adenoma), although not completely correct, probably deserves some credit as a reasonable conclusion. A very unpopular diagnosis (e.g. carcinoma) is likely to be completely wrong. Accepting this default has the advantage that no-one is asked to make any subjective decisions about the value of each diagnosis..

So, if your diagnosis was the one with the largest number on the list (tubulovillous adenoma) you will get the highest possible score (1.0). If you made one of the other diagnoses your score for that case will be proportionately lower, depending on the size of the number associated with the diagnosis you offered. Serrated adenoma would get (3 / 6) = 0.5, adenocarcinoma would get (0.5 / 6) = 0.083 - that is, almost zero.

Differential diagnosis lists

This gets more complicated. The following paragraph can be ignored if you never offer them.

If you offered a list of possible diagnoses, each with your estimate of the probability of it being correct, the procedure is more complicated. First the 'value' of each diagnosis in your list is calculated independently. Each diagnosis is then 'weighted' by the probability you gave it, and finally the figures for all the diagnoses in your differential list are added together, to produce the score for your response to this case. So if you said tubulovillous adenoma 5/10, metaplastic polyp 5/10, your score for that case would be (1 x 0.5)+(0.083 x 0.5) = 0.542. In practice, this usually produces a lower score, thus penalising uncertainty. The exception is if you attached a probability to the correct diagnosis which is greater than the average of the group managed, such as: TVA 8/10, Metaplastic polyp 2/10. Slight uncertainty in 'harder' cases where many pathologists are uncertain seems more reasonable, so in this case the response would be recorded as fully correct.

However, sometimes this default approach is manifestly unfair. For example, in the case given above, it may be that the lesion was genuinely a serrated adenoma but at the time of the circulation 60% of the participants had not yet heard of this lesion. The Participants' Meeting or a delegated rotating 'marking committee' must then decide what to do. This cannot be done by one individual, or they would have an unfair advantage as a participant.

In this case, the options might be:

1) Do not use the case for personal scoring as the majority got it 'wrong'.

2) Merge the second and third diagnoses to give:

Adenocarcinoma of colon 0.5

Tubulovillous/serrated adenoma 9

Metaplastic polyp 0.5

However, this arguably is unfair on those who identified the serrated adenoma and therefore did 'better' than those who called it a TVA. Furthermore, is 'metaplastic polyp' as serious an error as 'adenocarcinoma'? So, alternatively:

3) Generate completely new 'values' for evaluating each diagnosis, e.g.:

Adenocarcinoma of colon 0

Tubulovillous adenoma 7

Serrated adenoma 10

Metaplastic polyp 3

The problem here is in deciding on the values, which are obviously subjective, in a democratically acceptable way. Clearly this cannot be done by the Organiser, or that person could not also participate in the scheme and might impose views which were unacceptable to some participants. Deciding this sort of thing at a Participants' Meeting is democratic but can take a long time!

Finally, there is no reason why the computer system cannot be used as an aid to simple manual 'scoring' - apart from the time and effort it takes. For example, a scheme could have a rotating assessor or 'marking committee' to judge each participant's response to each case as 'fully correct', 'partly correct' or 'wrong'. To collate and process the data, these could then be entered as:

Case 1

Fully correct 10

Partly correct 5

Wrong 0

(As many degrees of 'correctness' as desired can be used, up to 10).

The consequence for the individual participant would be exactly as with manual collation, but with more sophisticated analysis and probably with fewer errors, because of the program's data verification systems.

Of course, the major problem remains - how to assign, fairly and reproducibly, a page of pathologist's scribbled notes into one of the three categories!

I hope this clarifies the process.

Peter Furness

Updated October 1999

Pathology EQA

Peter Furness, Department of Pathology, University of Leicester.

pnf1@le.ac.uk