Monday 20 February 2012

Reasoning for Rmeas or Rpim as Cutoff

From: Jacob Keller
Date: 27 January 2012 17:47


Dear Crystallographers,

I cannot think why any of the various flavors of Rmerge/meas/pim
should be used as a data cutoff and not simply I/sigma--can somebody
make a good argument or point me to a good reference? My thinking is
that signal:noise of >2 is definitely still signal, no matter what the
R values are. Am I wrong? I was thinking also possibly the R value
cutoff was a historical accident/expedient from when one tried to
limit the amount of data in the face of limited computational
power--true? So perhaps now, when the computers are so much more
powerful, we have the luxury of including more weak data?

JPK


--
*******************************************
Jacob Pearson Keller

----------
From: Jacob Keller


Clarification: I did not mean I/sigma of 2 per se, I just meant
I/sigma is more directly a measure of signal than R values.

JPK

----------
From: Graeme Winter


Hi Jacob,

I think a lot of people do use I/sigma :o)

However sigma is in some cases poorly defined, and the merging
residuals you refer to are calculated only from the I values and are
related to I/sigma anyhow... Of course the R values are sometimes also
poorly defined for low multiplicity data.

Best wishes,

Graeme

----------
From: John R Helliwell


Dear Jacob,
As an editor I am always mindful that an article is finally under the
authors' names. That said the reader always deserves to know at what
diffraction resolution average intensities (cease to) exist. The usual
statistical practice to do that is to use a given quantity's (ie in
this case a reflection intensity) sigma.

Good effort is made in data processing programs to protect the quality
of the estimate of each reflection intensity's sigma notably the chi
square test.

Thus I request that the diffraction resolution where <I/sig(I)>
crosses 2.0 is quoted in the article, if it is not there already. I
agree that 2.0 is arbitrary but it is more 'lenient' than the usual
'3sigma' statistical test.

Sometimes the title or abstract has to be changed to follow this
criterion; eg 'The structure of xxx is determined to 2.4 Angstrom
resolution' type of title has to be consistent with the above
criterion.

I do not follow an 'Rmerge must be less than x% rule'.

I think the above follows reasonable general statistical practice,
whilst permitting authors reasonable freedom, and also protects the
(more innocent) readers of articles.

I am aware that the 'correlation coefficient' between randomly
portioned parts of data sets is being increasingly discussed, this
parameter also having general statistical validity. I am monitoring
discussion on this carefully. It has long been a good way of assessing
the statistical quality of anomalous differences for example; to my
knowledge introduced by Michael Rossmann many years ago.

Best wishes,
John
--
Professor John R Helliwell DSc

----------
From: Horacio Botti

Dear all

Perhaps a bit off of theme, just an example about resolution cut-off

mean I/sigma(I) = 2 for dmin = 3.35 A

(please have a look at the attached pdf)

I would trust in I/s(I) = 2 (in this case it worked), but why not to determine what is information after the model has been refined to some extent using lower I/s(I) and then cutting the resolution by 0.05-0.1 A?

Delta R and (from procheck) Avg-G factors did well. Note that Rfree improved by using data from higher resolution.

Perhaps if Rmeas or Rpim were bad at that resolution (3.35 A) the story would be different.

Best regards,

Horacio

----------
From: Ian Tickle


Jacob, here's my (personal) take on this:

The data quality metrics that everyone uses clearly fall into 2
classes: 'consistency' metrics, i.e. Rmerge/meas/pim and CC(1/2) which
measure how well redundant observations agree, and signal/noise ratio
metrics, i.e. mean(I/sigma) and completeness, which relate to the
information content of the data.

IMO the basic problem with all the consistency metrics is that they
are not measuring the quantity that is relevant to refinement and
electron density maps, namely the information content of the data, at
least not in a direct and meaningful way.  This is because there are 2
contributors to any consistency metric: the systematic errors (e.g.
differences in illuminated volume and absorption) and the random
errors (from counting statistics, detector noise etc.).  If the data
are collected with sufficient redundancy the systematic errors should
hopefully largely cancel, and therefore only the random errors will
determine the information content.  Therefore the systematic error
component of the consistency measure (which I suspect is the biggest
component, at least for the strong reflections) is not relevant to
measuring the information content.  If the consistency measure only
took into account the random error component (which it can't), then it
would be essentially be a measure of information content, if only
indirectly (but then why not simply use a direct measure such as the
signal/noise ratio?).

There are clearly at least 2 distinct problems with Rmerge, first it's
including systematic errors in its measure of consistency, second it's
not invariant with respect to the redundancy (and third it's useless
as a statistic anyway because you can't do any significance tests on
it!).  The redundancy problem is fixed to some extent with Rpim etc,
but that still leaves the other problems.  It's not clear to me that
CC(1/2) is any better in this respect, since (as far as I understand
how it's implemented), one cannot be sure that the systematic errors
will cancel for each half-dataset Imean, so it's still likely to
contain a large contribution from the irrelevant systematic error
component and so mislead in respect of the real data quality exactly
in the same way that Rmerge/meas/pim do.  One may as well use the
Rmerge between the half dataset Imeans, since there would be no
redundancy effect (i.e. the redundancy would be 2 for all included
reflections).

I did some significance tests on CC(1/2) and I got silly results, for
example it says that the significance level for the CC is ~ 0.1, but
this corresponded to a huge Rmerge (200%) and a tiny mean(I/sigma)
(0.4).  It seems that (without any basis in statistics whatsoever) the
rule-of-thumb CC > 0.5 is what is generally used, but I would be
worried that the statistics are so far divorced from the reality - it
suggests that something is seriously wrong with the assumptions!

Having said all that, the mean(I/sigma) metric, which on the face of
it is much more closely related to the information content and
therefore should be a more relevant metric than Rmerge/meas/pim &
CC(1/2), is not without its own problems (which probably explains the
continuing popularity of the other metrics!).  First and most obvious,
it's a hostage to the estimate of sigma(I) used.  I've never been
happy with inflating the counting sigmas to include effects of
systematic error based on the consistency of redundant measurements,
since as I indicated above if the data are collected redundantly in
such a way that the systematic errors largely cancel, it implies that
the systematic errors should not be included in the estimate of sigma.
 The fact that then the sigma(I)'s would generally be smaller (at
least for the large I's), so the sample variances would be much larger
than the counting variances, is irrelevant, because the former
includes the systematic errors.  Also the I/sigma cut-off used would
probably not need to be changed since it affects only the weakest
reflections which are largely unaffected by the systematic error
correction.

The second problem with mean(I/sigma) is also obvious: i.e. it's a
mean, and as such it's rather insensitive to the actual distribution
of I/sigma(I).  For example if a shell contained a few highly
significant intensities these could be overwhelmed by a large number
of weak data and give an insignificant mean(I/sigma).  It seems to me
that one should be considering the significance of individual
reflections, not the shell averages.  Also the average will depend on
the width of the resolution bin, so one will get the strange effect
that the apparent resolution will depend on how one bins at the data!
The assumption being made in taking the bin average is that I/sigma(I)
falls off smoothly with d* but that's unlikely to be the reality.

It seems to me that a chi-square statistic which takes into account
the actual distribution of I/sigma(I) would be a better bet than the
bin average, though it's not entirely clear how one would formulate
such a metric.  One would have to consider subsets of the data as a
whole sorted by increasing d* (i.e. not in resolution bins to avoid
the 'bin averaging effect' described above), and apply the resolution
cut-off where the chi-square statistic has maximum probability.  This
would automatically take care of incompleteness effects since all
unmeasured reflections would be included with I/sigma = 0 just for the
purposes of working out the cut-off point.  I've skipped the details
of implementation and I've no idea how it would work in practice!

An obvious question is: do we really need to worry about the exact
cut-off anyway, won't our sophisticated maximum likelihood refinement
programs handle the weak data correctly?  Note that in theory weak
intensities should be handled correctly, however the problem may
instead lie with incorrectly estimated sigmas: these are obviously
much more of an issue for any software which depends critically on
accurate estimates of uncertainty!  I did some tests where I refined
data for a known protein-ligand complex using the original apo model,
and looked at the difference density for the ligand, using data cut at
2.5, 2 and 1.5 Ang where the standard metrics strongly suggested there
was only data to 2.5 Ang.

I have to say that the differences were tiny, well below what I would
deem significant (i.e. not only the map resolutions but all the map
details were essentially the same), and certainly I would question
whether it was worth all the soul-searching on this topic over the
years!  So it seems that the refinement programs do indeed handle weak
data correctly, but I guess this should hardly come as a surprise (but
well done to the software developers anyway!).  This was actually
using Buster: Refmac seems to have more of a problem with scaling &
TLS if you include a load of high resolution junk data.  However,
before anyone acts on this information I would _very_ strongly advise
them to repeat the experiment and verify the results for themselves!
The bottom line may be that the actual cut-off used only matters for
the purpose of quoting the true resolution of the map, but it doesn't
significantly affect the appearance of the map itself.

Finally an effect which confounds all the quality metrics is data
anisotropy: ideally the cut-off surface of significance in reciprocal
space should perhaps be an ellipsoid, not a sphere.  I know there are
several programs for anisotropic scaling, but I'm not aware of any
that apply anisotropic resolution cutoffs (or even whether this would
be advisable).

Cheers

-- Ian

----------
From: Randy Read


Just one thing to add to that very detailed response from Ian.

We've tended to use a slightly different approach to determining a sensible resolution cutoff, where we judge whether there's useful information in the highest resolution data by whether it agrees with calculated structure factors computed from a model that hasn't been refined against those data.  We first did this with the complex of the Shiga-like toxin B-subunit pentamer with the Gb3 trisaccharide (Ling et al, 1998).  From memory, the point where the average I/sig(I) drops below 2 was around 3.3A.  However, we had a good molecular replacement model to solve this structure and, after just carrying out rigid-body refinement, we computed a SigmaA plot using data to the edge of the detector (somewhere around 2.7A, again from memory).  The SigmaA plot dropped off smoothly to 2.8A resolution, with values well above zero (indicating significantly better than random agreement), then dropped suddenly.  So we chose 2.8A as the cutoff.  Because there were four pentamers in the asymmetric unit, we could then use 20-fold NCS averaging, which gave a fantastic map.  In this case, the averaging certainly helped to pull out something very useful from a very weak signal, because the maps weren't nearly as clear at lower resolution.

Since then, a number of other people have applied similar tests.  Notably, Axel Brunger has done some careful analysis to show that it can indeed be useful to take data beyond the conventional limits.

When you don't have a great MR model, you can do something similar by limiting the resolution for the initial refinement and rebuilding, then assessing whether there's useful information at higher resolution by using the improved model (which hasn't seen the higher resolution data) to compute Fcalcs.  By the way, it's not necessary to use a SigmaA plot -- the correlation between Fo and Fc probably works just as well.  Note that, when the model has been refined against the lower resolution data, you'll expect a drop in correlation at the resolution cutoff you used for refinement, unless you only use the cross-validation data for the resolution range used in refinement.

-----
Randy J. Read


----------
From: arka chakraborty


Hi all,

In the context of the above going discussion can anybody post links for a few relevant articles?

Thanks in advance,

ARKO
--

ARKA CHAKRABORTY


----------
From: Randy Read

Hi,

Here are a couple of links on the idea of judging resolution by a type of cross-validation with data not used in refinement:

  (cites earlier relevant papers from Brunger's group)

Best wishes,

Randy Read
------

----------
From: Frank von Delft


Hi Randy - thank you for a very interesting reminder to old literature. 

I'm intrigued:  how come this apparently excellent idea has not become standard best practice in the 14 years since it was published? 

phx

----------
From: Jacob Keller


Somebody sent this to me after a previous post a while back--a sort of
case-study:

Wang, J. (2010). Inclusion of weak high-resolution X-ray data for
improvement of a group II intron structure. Acta crystallographica
Section D, Biological crystallography 66, 988-1000.

JPK

----------
From: Jacob Keller


It would seem because too few people know about it, and it is not
implemented in any software in the usual pipeline. Maybe it could be?

Perhaps the way to do it would be always to integrate to
ridiculously-high resolution, give that to Refmac, and starting from
lower resolution, to iterate to higher resolution according the most
recent sigma a calculation, and cutoff according to some reasonable
sigma a value?

JPK

----------
From: Florian Schmitzberger


Phenix.model_vs_data calculates a sigmaA_ vs resolution plot (in comprehensive validation in the GUI). Pavel would probably have replied by now, but I don't think the discussion has been cross-posted to the phenix bb.

Cheers,

Florian

----------
From: Dunten, Pete W.


Frank,

Don't you already get a plot of SigmaA versus resolution from refmac,
where the free set of reflections has been used to estimate SigmaA?

Have a look at some of your log files.

Pete

----------
From: James Holton


Once upon a time, it was customary to apply a 3-sigma cutoff to each and every spot observation, and I believe this was the era when the "~35% Rmerge in the outermost bin" rule was conceived, alongside the "80% completeness" rule.  Together, these actually do make a " reasonable" two-pronged criterion for the resolution limit.

Now, by "reasonable" I don't mean "true", just that there is "reasoning" behind it.  If you are applying a 3-sigma cutoff to spots, then the expected error per spot is not more than ~33%, so if Rmerge is much bigger than that, then there is something "funny" going on.  Perhaps a violation of the chosen space group symmetry (which may only show up at high resolution), radiation damage, non-isomorphism, bad absorption corrections, crystal slippage or a myriad of other "scaling problems" could do this.  Rmerge became a popular statistic because it proved a good way of detecting problems like these in data processing.  Fundamentally, if you have done the scaling properly, then Rmerge/Rmeas should not be worse than the expected error of a single spot measurement.  This is either the error expected from counting statistics (33% if you are using a 3-sigma cutoff), or the calibration error of the instrument (~5% on a bad day, ~2% on a good one), whichever is bigger.

 As for completeness, 80% overall is about the bare minimum of what you can get away with before the map starts to change noticeably.  See my movie here:
http://bl831.als.lbl.gov/~jamesh/movies/index.html#completeness
so I imagine this "80% rule" just got extended to the outermost bin.  After all, claiming a given resolution when you've only got 50% of the spots at that resolution seems unwarranted, but requiring 100% completeness seems a little too strict.

Where did these rules come from?  As I recall, I first read about them in the manual for the "PROCESS" program that came with our R-axis IIc x-ray system when I was in graduate school (ca 1996).  This program was conveniently integrated into the data collection software on the detector control computers: one was running VMS, and the "new" one was an SGI.  I imagine a few readers of this BB may have never heard of "PROCESS", but it is listed as the "intensity integration software" for at least a thousand PDB entries.  Is there a reference for "PROCESS"?  Yes.  In the literature it is almost always cited with: (Molecular Structure Corporation, The Woodlands, TX).  Do I still have a copy of the manual?  Uhh.  No.  In fact, the building that once contained it has since been torn down.  Good thing I kept my images!

Is this "35% Rmerge with a 3-sigma cutoff" method of determining the resolution limit statistically valid?  Yes!  There are actually very sound statistical reasons for it.  Is the resolution cutoff obtained the best one for maximum-likelihood refinement?  Definitely not!  Modern refinement programs do benefit from weak data, and tossing it all out messes up a number of things.  Does including weak data make Rmerge/Rmeas/Rpim and R/Rfree go up?  Yes.  Does this make them more "honest"?  No.  It actually makes them less useful.

Remember all R factors are measures of _relative_ error, so it is important to remember to ask the question: "Relative to what?".  For Rmerge, the "what" is the sum of all the spot intensities (Blundell and Johnson, 1976).  Where you run into problems is when you restrict the Rmerge calculation to a single resolution bin.  If the sum of all intensities in the bin is actually zero, then Rmerge is undefined (dividing by zero).  If the signal-to-noise ratio is ~1, then the Rmerge equation doesn't "blow up" mathematically, but it does give essentially random results.  This is because Rmerge values for data this weak take on a Cauchy distribution, and no matter how much averaging you do, Cauchy-distributed values have a random mean.  You can see in the classic Weiss & Hilgenfeld (1997) paper that they had to use "outlier rejection" with their fake data to get Rmerge to behave even with a signal-to-noise ratio of 2.  The "turn over point" where the Rmerge equation becomes mathematically well-behaved (Gaussian rather than Cauchy distribution) is when the signal-to-noise ratio is about 3.  I believe this is why our forefathers used a 3-sigma cutoff.

Now, a 3-sigma cutoff on the raw observation data may sound like heresy today, and I do NOT recommend you feed such data to refinement or other downstream programs.  But, it is important to remember what you are trying to measure!  If you are trying to detect scaling errors, then you should be looking at spots where scaling errors are not masked by other kinds of error.  For example, a spot with only one photon in it is not going to tell you very much about the accuracy of your scales, but its average |delta-intensity|/intensity is going to be huge.  That is, the pre-R-factor sigma cutoff isolates the R factor calculation to spots dominated by scaling errors.  Including weaker data with their Cauchy-distributed R factor simply adds noise to the value of the R factor itself.

So, I'd say if you have a reviewer complaining that your Rmerge in the outermost bin is too high, simply tell the editor that you did not use a 3-sigma cutoff on the raw data for the Rmerge calculation, and ask if he/she would prefer that you did.

-James Holton
MAD Scientist

----------
From: Randy Read


Hi Frank,

Now that I've been forced to recalibrate my measure of "old literature" (apparently it's not just literature dating from before you started your PhD)...

This idea has been used occasionally, but I think it might be becoming more relevant as more structures are done at low resolution.  To be honest, my criteria for resolution tend to be flexible -- if a crystal diffracts to 2.2A resolution by the I/sig(I)>2 criterion, there's less incentive to worry about it -- pushing it to 2A probably won't make much difference to the biological questions that can be answered.  But if the 2-sigma criterion says that the crystal diffracts to 3.3A, then I'm much more likely to see how much more can justifiably be squeezed out of it.  Actually, I'm more inclined to start from a 1-sigma cutoff in the first instance, trusting the maximum likelihood methods to deal appropriately with the uncertainty.

It's not too hard to do the higher-resolution cross-validation, but there are a number of things to worry about.  First, I remember being told that data processing programs will do a better job of learning the profiles if you only give them data to a resolution where there are real spots, so you probably don't always want to integrate to an arbitrarily high resolution, unless you're willing to go back to the integration after reassessing the resolution cutoff.  Maybe, as a standard protocol, one could integrate to a conservative resolution and a much higher resolution, then use the conservative data set for initial work and the higher resolution data set for evaluating the optimal resolution cutoff -- and then reintegrate one more time later at that resolution, using those data for the rest of the structure determination.

The idea of using the SigmaA curve from Refmac has come up, but SigmaA curves from cross-validation data will have a problem.  In order to get these to behave (with a small number of reflections per resolution bin), you need to smooth the curve in some way.  Refmac does this by fitting a functional form, so the high-resolution SigmaA values are bound to drop off smoothly regardless of the real structure factor agreement.  If you're evaluating resolution immediately after molecular replacement with a good model, then you could use my old SIGMAA program to get independent SigmaA values for individual resolution bins, using all the data (because there's no danger of over-fitting).  However, if you start out with a poor model or solve the structure by experimental phasing, you'll have to do some building and refinement before you have a model good enough to compare with the higher-resolution data.  Then you want to compare the fit of the cross-validation data up to the resolution cutoff used in refinement to the resolution-dependent fit of all the higher resolution data not used in refinement.  I'd probably do that, at the moment, by using sftools to select all the data that haven't been used in refinement then calculate correlation coefficients in resolution bins (which are probably as good for this purpose as SigmaA values).  (For non-aficionados of sftools, the selection could be done by selecting the reflections with d-spacing less than dmin for your refinement, selecting the subset of those that are in the working set, then inverting the selection to get everything not used in refinement.)

Regards,

Randy

----------
From: Ronald E Stenkamp


James Holton suggested a reason why the "forefathers" used a 3-sigma cutoff.

I'll give another reason provided to me years ago by one of those guys, Lyle Jensen.  In the 70s we were interested in the effects of data-set thresholds on refinement (Acta Cryst., B31, 1507-1509 (1975)) so he explained his view of the history of "less-than" cutoffs for me.  It was a very Seattle-centric explanation.

In the 50s and 60s, Lyle collected intensity data using an integrating Weissenberg camera and a film densitometer.  Some reflections had intensities below the fog or background level of the film and were labeled "unobserved".  Sometimes they were used in refinement, but only if the calculated Fc values were above the "unobserved" value.

When diffractometers came along with their scintillation counters, there were measured quantities for each reflection (sometimes negative), and Lyle needed some way to compare structures refined with diffractometer data with those obtained using film methods.  Through some method he never explained, a value of 2-sigma(I) defining "less-thans" was deemed comparable to the "unobserved" criterion used for the earlier structures.  His justification for the 2-sigma cutoff was that it allowed him to understand the refinement behavior and R values of these data sets collected with newer technology.

I don't know who all contributed to the idea of a 2-sigma cutoff, nor whether there were theoretical arguments for it.  I suspect the idea of some type of cutoff was discussed at ACA meetings and other places.  And a 2-sigma cutoff might have sprung up independently in many labs.

I think the gradual shift to a 3-sigma cutoff was akin to "grade inflation".  If you could improve your R values with a 2-sigma cutoff, 3-sigma would probably be better.  So people tried it.  It might be interesting to figure out how that was brought under control.  I suspect a few troublesome structures and some persistent editors and referees gradually raised our group consciousness to avoid the use of 3-sigma cutoffs.

Ron

----------
From: Phoebe Rice


I'm enjoying this discussion.
It also seems like a good spot to inject my standard plea for better treatment of anisotropy in things like "table 1" of papers and PDB deposition forms.  When you have Ewald's football (American football), like many nucleic acid-ophiles do, one number simply isn't enough to describe the data.

=====================================
Phoebe A. Rice



----------
From: Navraj S. Pannu


I am concerned about one basic problem about using sigmaa to judge the
quality of the data and, in particular, data cut-offs: sigmaa depends on
a model! judging data quality by using a model that we derived from is worrying.

sigmaa is by definition the correlation between the (unknown) true structure factor and the calculated structure factor.  A low sigmaa value can mean a poor model or poor data or both and no one can differentiate between it.  Of course, expertise in looking at maps and its fit to a
model can lead one to (subjectively) conclude that a model is as best as it can get and fully refined and a map is to a given resolution and then
use a sigmaa plot to decide on data quality and cut offs, but this seems questionable.

and, as eluded to earlier, sigmaa is a parameter that is estimated and there are many ways of doing so: each way, of course, having its own strengths and weaknesses.  for example, a likelihood function that uses measurement errors can model disagreement between observed and calculated structure factors (in particular, with large measurement errors that one can/does see at a high resolution limit) and this can lead to an 'optimistic' sigmaa estimate that can be used to incorrectly justify an optimistic data cut-off.  given all the problems and instability that can occur with sigmaa estimation, i would think that a better estimate of agreement between observed and calculated structure factor amplitude would be the R-free as a function of resolution and this can be used with I/sigma and Rpim/meas to provide insights to what resolution we can say a model and data are in agreement.  Perhaps instead of saying data cutoffs, a better way of saying it is to what resolution is our model and data in agreement...


No comments:

Post a Comment