On review quality definitions and the ranking
Elsewhere:
Basic ideas
The RQC grading system is based on the following requirements:
- There should be a common grading framework that is the same
throughout all venues (conferences and journals)
and that does not change over time.
- Yet every venue should be able to define review quality in manner
that is appropriate for the specifics of its topic area
and editorial style and requirements.
- RQC is initially based on grading (using absolute terms for review quality)
but its output is in terms of a ranking (using relative reviewing quality),
so there can be no grade inflation where nearly everybody is just super.
- Review quality definitions must be sufficiently constant to allow
comparing and ranking the reviewers within a large batch of reviews.
- Yet it should also be possible to modify them over time.
These requirements are fulfilled in the following manner:
- There is a common framework of the four aspects A, D, G, T;
see below.
- Every venue has its own (and possibly unique)
review quality definition (RQdef; see below).
- The RQdef can be changed for each new instance of a conference
or reviewing year of a journal.
How to interpret the review quality definition (RQdef)
Look at the example RQdef for reference.
There are four aspects:
A: Helpfulness for Authors
D: Helpfulness for Decision
G: Graded Co-Reviews
T: Timeliness
The score for a review is computed as follows:
- One level is chosen by human graders from each A and D facet.
Its points are used.
- These points are multiplied by the respective localweight and summed.
The sum, divided by the sum of those localweights are the aspect's value.
- The first applicable level from the T aspect is used
to give its value.
- The value of the G aspect is simply the percentage of co-reviews
this reviewer has graded (or 100 if there were none to be graded).
That is why there are no facets or levels for G.
- The aspect values are multiplied by the respective weight and summed.
The sum, divided by the sum of the weights, is the total for that review.
The per-review score percentile (100 is best) is for information only;
the totals are used for computing the reviewer score and ranking.
Assumed reviewing model
RQC makes the following assumptions about the way reviewing
should be interpreted:
- All reviewers are treated equally.
There is no difference between PC chairs, PC members,
primary reviewers, secondary reviewers, program board members,
superreviewers, external reviewers, subreviewers, and what-have-you.
- All reviews are treated equally.
- Reviews that are assigned/agreed but never delivered are treated
like delivered reviews.
They will be considered infinitely late and their empty
review text will be graded just like other review texts.
- Reviewers are anonymous.
RQC will not show the name of a reviewer, but will provide a link
to the respective review page in the manuscript handling system,
so if names are visible there they are available indirectly.
- Authors of submissions are anonymous.
RQC will show the title of a submission, but not the author names.
- At conferences, PC members with zero assigned reviews are not reviewers.
They will not be included in the RQC process.
- At conferences, reviewing happens in one single phase with one single deadline.
If there are indeed multiple rounds of reviewing, the organizer of
the respective RQC process can decide at what moment the
set of all then-existing reviews will be retrieved to be graded.
How the ranking is computed
Short description:
- Each reviewer gets a score total that by-and-large is the
average of that reviewer's per-review total scores.
- Reviewers are ranked into percentiles (100 is best, 0 is worst)
according to these score totals.
The percentiles are computed such that the average across the given
batch of reviewers is 50 even in the presence of ties.
Precise description:
- For each reviewer, a reviewer score is computed as follows.
Assume the reviewer has supplied N reviews.
- If Use only the best-k reviews is a fixed number K,
the weakest-scored reviews beyond the top K of that reviewer will be ignored.
- If Use only the best-k reviews is a percentile P,
K is the highest number of reviews that the P percent
most productive reviewers have all reached or exceeded.
(For instance if P is 25, at most one fourth of the reviewers will get
one or more of their weaker reviews ignored.)
- The remaining per-review totals are averaged to produce the reviewer score,
using the geometric mean if Emphasize consistent quality is yes
or the arithmetic mean otherwise.
- This total is adjusted by a factor according to the number of reviews N:
- If Weigh-in number of reviews is no, the factor is 1.
- If it is somewhat, the factor is the eighth root of N.
- If it is yes, the factor is the fourth root of N.
- The result is the overall reviewer score.
- These scores are ranked and turned into percentiles such that the
best possible value is 0 and the worst 100, while the mean is guaranteed
to be 50.
- For instance, if all reviewers have the same score, all of them have
percentile 50.
- If half have score 70 and half score 80, their percentiles will be
33 and 67, respectively.
- Conference PC members without reviews are not included in the
percentile computation.
Reviewers that have reviews but have opted out are included
with a score of 0. (Corresponding receipts never show up.)
- Only if all scores are different will the
range of percentiles be used to the max.