CCP4 Bulletin Board Archive: should the final model be refined against full datset

From: Ed Pozharski
Date: 14 October 2011 20:52
This is a follow up (or a digression) to James comparing test set to
missing reflections. I also heard this issue mentioned before but was
always too lazy to actually pursue it.

So.

The role of the test set is to prevent overfitting. Let's say I have
the final model and I monitored the Rfree every step of the way and can
conclude that there is no overfitting. Should I do the final refinement
against complete dataset?

IMCO, I absolutely should. The test set reflections contain
information, and the "final" model is actually biased towards the
working set. Refining using all the data can only improve the accuracy
of the model, if only slightly.

The second question is practical. Let's say I want to deposit the
results of the refinement against the full dataset as my final model.
Should I not report the Rfree and instead insert a remark explaining the
situation? If I report the Rfree prior to the test set removal, it is
certain that every validation tool will report a mismatch. It does not
seem that the PDB has a mechanism to deal with this.

Cheers,

Ed.

--
Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
Julian, King of Lemurs

----------
From: Nat Echols

You should enter the statistics for the model and data that you actually deposit, not statistics for some other model that you might have had at one point but which the PDB will never see. Not only does refining against R-free make it impossible to verify and validate your structure, it also means that any time you or anyone else wants to solve an isomorphous structure by MR using your structure as a search model, or continue the refinement with higher-resolution data, you will be starting with a model that has been refined against all reflections. So any future refinements done with that model against isomorphous data are pre-biased, making your model potentially useless.

I'm amazed that anyone is still depositing structures refined against all data, but the PDB does still get a few. The benefit of including those extra 5% of data is always minimal in every paper I've seen that reports such a procedure, and far outweighed by having a reliable and relatively unbiased validation statistic that is preserved in the final deposition. (The situation may be different for very low resolution data, but those structures are a tiny fraction of the PDB.)

-Nat

----------
From: Robbie Joosten

Hi Ed,

Hmm, if your R-free set is small the added value will also be small. If it is relatively big, then your previously established optimal weights may no longer be optimal. A more elegant thing to would be refine the model with, say, 20 different 5% R-free sets, deposit the ensemble and report the average R(-free) plus a standard deviation. AFAIK, this is what the R-free set numbers that CCP4's FREERFLAG generates are for. Of course, in that case you should do enough refinement (and perhaps rebuilding) to make sure each R-free set is free.

The deposited R-free sets in the PDB are quite frequently 'unfree' or the wrong set was deposited (checking this is one of the recommendations in the VTF report in Structure). So at the moment you would probably get away with depositing an unfree R-free set ;)

Cheers,

Robbie

----------
From: Quyen Hoang
Sorry, I don't quite understand your reasoning for how the structure is rendered useless if one refined it with all data.
Would your argument also apply to all the structures that were refined before R-free existed?

Quyen

----------
From: Craig A. Bingman

Recent experience indicates that the PDB is checking these statistics very closely for new depositions. The checks made by the PDB are intended to prevent accidents and oversights made by honest people from creeping into the database. "Getting away" with something seems to imply some intention to deceive, and that is much more difficult to detect.

----------
From: Jan Dohnalek
Regarding refinement against all reflections: the main goal of our work is to provide the best possible representation of the experimental data in the form of the structure model. Once the structure building and refinement process is finished keeping the Rfree set separate does not make sense any more. Its role finishes once the last set of changes have been done to the model and verified ...

J. Dohnalek

--
Jan Dohnalek, Ph.D
Institute of Macromolecular Chemistry
Academy of Sciences of the Czech Republic
Heyrovskeho nam. 2
16206 Praha 6
Czech Republic

----------
From: Nat Echols

"Useless" was too strong a word (it's Friday, sorry). I guess simulated annealing can address the model-bias issue, but I'm not totally convinced that this solves the problem. And not every crystallographer will run SA every time he/she solves an isomorphous structure, so there's a real danger of misleading future users of the PDB file. The reported R-free, of course, is still meaningless in the context of the deposited model.

Technically, yes - but how many proteins are there whose only representatives in the PDB were refined this way? I suspect very few; in most cases, a more recent model should be available.

-Nat

----------
From: Quyen Hoang

I still don't understand how a structure model refined with all data would negatively affect the determination and/or refinement of an isomorphous structure using a different data set (even without doing SA first).

Quyen

----------
From: Phil Jeffrey
Let's say you have two isomorphous crystals of two different protein-ligand complexes. Same protein different ligand, same xtal form. Conventionally you'd keep the same free set reflections (hkl values) between the two datasets to reduce biasing. However if the first model had been refined against all reflections there is no longer a free set for that model, thus all hkl's have seen the atoms during refinement, and so your R-free in the second complex is initially biased to the model from the first complex. [*]

The tendency is to do less refinement in these sort of isomorphous cases than in molecular replacement solutions, because the structural changes are usually far less (it is isomorphous after all) so there's a risk that the R-free will not be allowed to fully float free of that initial bias. That makes your R-free look better than it actually is.

This is rather strongly analogous to using different free sets in the two datasets.

However I'm not sure that this is as big of a deal as it is being made to sound. It can be dealt with straightforwardly. However refining against all the data weakens the use of R-free as a validation tool for that particular model so the people that like to judge structures based on a single number (i.e. R-free) are going to be quite put out.

It's also the case that the best model probably *is* the one based on a careful last round of refinement against all data, as long as nothing much changes. That would need to be quantified in some way(s).

Phil Jeffrey
Princeton

[* Your R-free is also initially model-biased in cases where the data are significant non-isomorphous or you're using two different xtal forms, to varying extents]

----------
From: Felix Frolow

Recently we (I mean WE - community) frequently refine structures around 1 Angstrom resolution.

This is not what for the Rfree was invented. It was invented to go away with 3.0-2.8 Angstrom data

in times when people did not possess facilities good enough to look on the electron density maps….

We finish (WE - I again mean - community) the refinement of our structures too early.

Dr Felix Frolow
Professor of Structural Biology and Biotechnology
Department of Molecular Microbiology
and Biotechnology
Tel Aviv University 69978, Israel

Acta Crystallographica F, co-editor

----------
From: Ed Pozharski
If you read my post carefully, you'll see that I never suggested
reporting statistics for one model and depositing the other
Frankly, I think you are exaggerating the magnitude of model bias in the
situation that I described. You assume that the refinement will become
severely unstable after tossing in the test reflections. Depending on
the resolution etc, the rms shift of the model may vary but if it even
is, say half an angstrom the model hardly becomes useless (and that is
hugely overestimated). And at least in theory including *all the data*
should make the model more, not less accurate.
And so is probably the benefit of excluding when all the steps that
require cross-validation have already been performed. My thinking is
that excluding data from analysis should always be justified (and in the
initial stages of refinement, it might be as it prevents overfitting),
not the other way around.

Cheers,

Ed.

--
"Hurry up before we all come back to our senses!"
Julian, King of Lemurs

----------
From: Craig A. Bingman
We have obligations that extend beyond simply presenting a "best" model.

In an ideal world, the PDB would accept two coordinate sets and two sets of statistics, one for the last step where the cross-validation set was valid, and a final model refined against all the data. Until there is a clear way to do that, and an unambiguous presentation of them to the public, IMO, the gains won by refinement against all the data are outweighed by the Confusion that it can cause when presenting model and associated statistics to the public.

----------
From: Quyen Hoang
Thanks for the clear explanation. I understood that.
But I was trying to understand how this would negatively affects the initial model to render it useless or less useful.
In the scenario that you presented, I would expect a better result (better model) if the initial model was refined with all data, thus more useful.
Sure, again in your scenario, the "new" structure has seen R-free reflections in the equivalent indexes of its replacement model, but their intensities should be different anyway, so I am not sure how this is bad. Even if the bias is huge, let's say this bias results in 1% reduction in initial R-free (exaggerating here), how would this makes one's model bad or how would this be bad for one's science?
In the end, our objective is to build the best model possible and I think that more data would likely result in better model, not the other way around. If we can agree that refining a model with all data would result in a better model, then wouldn't not doing so constitute a compromise of model quality for a more "pure" statistic?

I had not refined a model with all data before (just to keep inline), but I wondered if I was doing the best thing.

Cheers,
Quyen

----------
From: Ethan Merritt
A model with error bars is more useful than a marginally more
accurate model without error bars, not least because you are probably
taking it on faith that the second model is "more accurate".

Crystallographers were kind of late in realizing that a cross validation
test could be useful in assessing refinement. What's more, we
never really learned the whole lesson. Rather than using the full
test, we use only one blade of the jackknife.

http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation

The full test would involve running multiple parallel refinements,
each one omiting a different disjoint set of reflections.
The ccp4 suite is set up to do this, since Rfree flags by default run
from 0-19 and refmac lets you specify which 5% subset is to be omitted
from the current run. Of course, evaluating the end point becomes more
complex than looking at a single number "Rfree".

Surely someone must have done this! But I can't recall ever reading
an analysis of such a refinement protocol.
Does anyone know of relevant reports in the literature?

Is there a program or script that will collect K-fold parallel output
models and their residuals to generate a net indicator of model quality?

Ethan

--
Ethan A Merritt
Biomolecular Structure Center, K-428 Health Sciences Bldg
University of Washington, Seattle 98195-7742

----------
From: Phil Evans
I just tried refining a "finished" structure turning off the FreeR set, in Refmac, and I have to say I can barely see any difference between the two sets of coordinates.

From this n=1 trial, I can't see that it improves the model significantly, nor that it ruins the model irretrievably for future purposes.

I suspect we worry too much about these things

Phil Evans

----------
From: Thomas C. Terwilliger
For those who have strong opinions on what data should be deposited...

The IUCR is just starting a serious discussion of this subject. Two
committees, the "Data Deposition Working Group", led by John Helliwell,
and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
are working on this.

Two key issues are (1) feasibility and importance of deposition of raw
images and (2) deposition of sufficient information to fully reproduce the
crystallographic analysis.

I am on both committees and would be happy to hear your ideas (off-list).
I am sure the other members of the committees would welcome your thoughts
as well.

-Tom T

Tom Terwilliger

----------
From: Gerard Bricogne
Dear Tom,

I am not sure that I feel happy with your invitation that views on such
crucial matters as these deposition issues be communicated to you off-list.
It would seem much healthier if these views were aired out within the BB.
Again!, some will say ... but the difference is that there is now a forum
for them, set up by the IUCr, that may eventually turn opinions into some
form of action.

I am sure that many subscribers to this BB, and not just you as a
member of some committees, would be interested to hear the full variety of
views on the desirable and the feasible in these areas, and to express their
own for everyone to read and discuss.

Perhaps John Helliwell can elaborate on this and on the newly created
forum.

With best wishes,

Gerard.

--

--

===============================================================
* *
* Gerard Bricogne
* *
* Global Phasing Ltd. *
* Sheraton House, Castle Park
* Cambridge CB3 0AX, UK
* *
===============================================================

----------
From: D Bonsor
I may be missing something or someone could point out that I am wrong and why as I am curious, but with a highly redundant dataset the difference between refining the final model against the full dataset would be small based upon the random selection of reflections for Rfree?

----------
From: Thomas C. Terwilliger
Dear Gerard,

I'm very happy for the discussion to be on the CCP4 list (or on the IUCR
forums, or both). I was only trying to not create too much traffic.

All the best,
Tom T

----------
From: Edward A. Berry
Now it would be interesting to refine this structure to convergence,
with the original free set. If I understood correctly Ian Tickle has
done essentially this, and the Free R returns essentially to its
original value: the minimum arrived at is independent of starting
point, perhaps within limitation that one might get caught in a
different false minimum (which is unlikely given the miniscule changes
you see). If that is the case we should stop worrying about
"corrupting" the free set by refining against it or even using it
to make maps in which models will be adjusted.
This is a perennial discussion but I never saw the report that
in fact original free-R is _not_ recoverable by refining to
convergence. Now it would be interesting to refine this structure to convergence,
with the original free set. If I understood correctly Ian Tickle has
done essentially this, and the Free R returns essentially to its
original value: the minimum arrived at is independent of starting
point, perhaps within limitation that one might get caught in a
different false minimum (which is unlikely given the miniscule changes
you see). If that is the case we should stop worrying about
"corrupting" the free set by refining against it or even using it
to make maps in which models will be adjusted.
This is a perennial discussion but I never saw the report that
in fact original free-R is _not_ recoverable by refining to
convergence.
Indeed, perhaps we worry too much about such things.

----------
From: James Stroud

Each R-free flag corresponds a particular HKL index. Redundancy refers to the number of times a reflection corresponding to a given HKL index is observed. The final structure factor of a given HKL can be thought of as an average of these redundant observations.

Related to your question, someone once mentioned that for each particular space group, there should be a preferred R-free assignment. As far as I know, nothing tangible ever came of that idea.

James

----------
From: Ed Pozharski
The amplitude of the shift, I presume, depends on the resolution and
data quality. With a very good 1.2A dataset refined with anisotropic
B-factors to R~14% what I see is ~0.005A rms shift. Which is not much,
however the reported ML DPI is ~0.02A, so perhaps the effect is not that
small compared to the precision of the model.

On the other hand, the more "normal" example at 1.7A (and very good data
refining down to R~15%) shows ~0.03A general variation with a variable
test set. Again, not much, but the ML DPI in this case is ~0.06A -
comparable to the variation induced by the choice of the test set.

Cheers,

Ed.

--
Hurry up, before we all come back to our senses!
Julian, King of Lemurs

----------
From: Pavel Afonine
Hi,

yes, shifts depend on resolution indeed. See pages 75-77 here:

http://www.phenix-online.org/presentations/latest/pavel_refinement_general.pdf

Pavel

----------
From: Nicholas M Glykos
Dear Ethan, List,
Total statistical cross validation is indeed what we should be doing, but
for large structures the computational cost may be significant. In the
absence of total statistical cross validation the reported Rfree may be an
'outlier' (with respect to the distribution of the Rfree values that would
have been obtained from all disjoined sets). To tackle this, we usually
resort to the following ad hoc procedure :

At an early stage of the positional refinement, we use a shell script
which (a) uses Phil's PDBSET with the NOISE keyword to randomly shift
atomic positions, (b) refine the resulting models with each of the
different free sets to completion, (c) Calculate the mean of the resulting
free R values, (d) Select (once and for all) the free set which is closer
to the mean of the Rfree values obtained above.

For structures with a small number of reflections, the statistical noise
in the 5% sets can be very significant indeed. We have seen differences
between Rfree values obtained from different sets reaching up to 4%.

Ideally, and instead of PDBSET+REFMAC we should have been using simulated
annealing (without positional refinement), but moving continuously between
the CNS-XPLOR and CCP4 was too much for my laziness.

All the best,
Nicholas

--

Dr Nicholas M. Glykos, Department of Molecular Biology
and Genetics, Democritus University of Thrace, University Campus,
Dragana, 68100 Alexandroupolis, Greece,

----------
From: Anastassis Perrakis
This is very intriguing indeed!
Is there something specific in these structures that Rfree differences depending
on the set used reach 4%? NCS? Or the 5% set having less than ~1000-1500 reflections?

It would be indeed very interesting if there was a correlation there!

A.

----------
From: Nicholas M Glykos
Tassos, by your standards, these structures should have been described as
'tiny' and not small ... ;-) [Yes, significantly less than 1000. In one
case the _total_ number of reflections was 5132 reflections (which were,
nevertheless, slowly and meticulously measured by a CAD4 one-by-one. These
were the days ... :-)) ].

----------
From: Boaz Shaanan

Just a naive question: isn't all what we're doing in refinements resolution dependent?

Boaz

Boaz Shaanan, Ph.D.
Dept. of Life Sciences
Ben-Gurion University of the Negev
Beer-Sheva 84105
Israel

----------
From: Ed Pozharski
This produces a curious paradox.

One possible reason for the variation in Rfree when choosing a different
test sets is that by pure chance reflections with more/less noise can be
selected. Which automatically means that the working set contains
reflections with less/more noise and therefore the model (presumably)
gets better/worse. So, selecting a test set that results in lower Rfree
leads to the model which is likely worse?

In fact, an obvious way to improve the Rfree through choice of a better
test set is by biasing it towards stronger reflections in each
resolution shell.

Selecting a test set that minimizes Rfree is so wrong on so many levels.
Unless, of course, the only thing I know about Rfree is that it is the
magic number that I need to make small by all means necessary.

----------
From: Pavel Afonine
Hi,

this is in line with my observations too.

Not surprising at all, though (see my previous post on this subject): a small seemingly insignificant change somewhere may result in refinement taking a different pathway leading to a different local minimum. There is even way of making practical use of this (Rice, Shamoo & Brunger, 1998; Korostelev, Laurberg & Noller, 2009; ...).

This "seemingly insignificant change somewhere" may be:

- what Ed mentioned (different noise level in free reflections or simply different strength of reflections in free set between sets);

- slightly different staring conditions (starting parameter value);

- random seed used in Xray/restraints target weight calculation (applies to phenix.refine),

- I can go on for 10+ possibilities.

I do not know whether choosing the result with the lowest Rfree is a good idea or not (after reading Ed's post I am slightly puzzled now), but what's definitely a good idea in my opinion is to know the range of possible R-factor values in your specific case, so you know which difference between two R-factors obtained in two refinement runs is significant and which one is not.

Pavel

----------
From: Tim Gruene

Dear Nicholas,

for a data set with 5132 unique reflections you should flag 10.5% for
Rfree, otherwise you could as well drop Rfree completely and use the
whole data set for refinement. At least this is how I understand Axel
Brunger's article about Rfree where he states that one needs 500-1000
reflections for a significant meaning of Rfree.

I have wondered where the '5%-rule' came in which compromises the Rfree
for low resolution data sets (especially with high symmetry).

If Axel Brunger's initial statement has become obsolete I would
appreciate some clarification on the required number of flagged
reflection, but until then I will keep on flagging 500-1000 reflections,
rather than 5%.

Tim
- --
- --
Dr Tim Gruene
Institut fuer anorganische Chemie
Tammannstr. 4
D-37077 Goettingen

----------
From: John R Helliwell
Dear Gerard,Tom and Bernhard,
Thankyou for highlighting the IUCr Diffraction Data Deposition Working
Group and Forum.

Dear Colleagues,
I am travelling at present and apologise for not replying sooner to
the CCP4bb, and also am with intermittent email access until later
this week when I 'return to office'.

The points being raised in this CCP4bb thread are very important and
the IUCr also recognises this.

The role of the IUCr Working Group that has been set up is to bring to
a focus information and to identify steps forward. We seek to make
progress towards archiving and making available all relevant
scientific data associated with a publication (or a completed
structure deposition in a validated database such as the PDB). The
consultation process is being formalised via the IUCr Forum pages. The
Working Group and a wider Group consisting of IUCr Commissions and
consultants has been established for discussion and planning. We are
also aiming at a community consultation via the Forum approach and we
will launch the Forum for this asap.

The IUCr invites as wide as possible inputs, from the various
communities that the IUCr Commissions serve, on the diffraction data
deposition future, which can surely be improved. Thus this Forum will
help to record an organised set of inputs for future reference.

The Forum is being set up and will require registration, which is a
straightforward process. Details will follow shortly.

Members of the Working Group and its consulted representatives are listed below.

Best wishes and regards,
Yours sincerely,
John
Prof John R Helliwell DSc
Chairman of the IUCr Diffraction Data Deposition Working Group (IUCr DDD WG).

IUCr DDD WG Members
Steve Androulakis (TARDIS representative)
John R. Helliwell (Chair) (IUCr ICSTI Representative; Chairman of the
IUCr Journals Commission 1996-2005)
Loes Kroon-Batenburg (Data processing software)
Brian McMahon (IUCr CODATA Representative)
John Westbrook (wwPDB representative and COMCIFS)
Sol Gruner (Diffuse scattering specialist and SR Facility Director)
Heinz-Josef Weyer (SR and Neutron Facility user)
Tom Terwilliger (Macromolecular Crystallography)

Consultants:
Alun Ashton (Diamond Light Source (DLS); Data Archive leader there)
Herbert Bernstein (Head of the imgCIF Dictionary Maintenance Group and
member of COMCIFS)
Frances Bernstein (Observer on data deposition policies)
Gerard Bricogne (Active software and methods developer)
Bernhard Rupp ( Macromolecular crystallographer)

IUCr Commissions (Chairs and/or alternates).

--
Professor John R Helliwell DSc

----------
From: Thomas C. Terwilliger
I think that we are using the test set for many things:

1. Determining and communicating to others whether our overall procedure
is overfitting the data.

2. Identifying the optimal overall procedure in cases where very different
options are being considered (e.g., should I use TLS).

3. Calculating specific parameters (eg sigmaA).

4. Identifying the "best" set of overall parameters.

I would suggest that we should generally restrict our usage of the test
set to purposes #1-3. Given a particular overall procedure for
refinement, a very good set of parameters should be obtainable from the
working set of data.

In particular, approaches in which many parameters (in the limit... all
parameters) are fit to minimize Rfree do not seem likely to produce the
best model overall. It might be worth doing some experiments with the
super-free set approach to determine whether this is true.

----------
From: Pavel Afonine
Yes, Rsleep seems to be just the right thing to use for this:

Separating model optimization and model validation in statistical cross-validation as applied to crystallography

G. J. Kleywegt

Acta Cryst. (2007). D63, 939-940

Practically, it would mean that we split 10% of test reflections into 5% used for optimizations like #1-4, and the other 5% (sleep set) is never ever used for anything. The big question here is: whether this will make any important difference? I suspect, as with many similar things, there will be no clear-cut answer (that is it may or may not make difference, case dependent).

Pavel

----------
From: Ed Pozharski
By using a simple genetic algorithm, I managed to get Rfree for a
well-refined model as low as 14.6% and as high as 19.1%. The dataset is
not too small (~40,000 reflection in all with the standard sized 5% test
set). So you can get spread as wide as 4.5% even with not-so-small
dataset. Only ~1/3 of test reflections are exchanged to achieve this.

What's curious is that, contrary to my expectations, the test set
remains well distributed throughout resolution shells upon this awful
"optimization" and the <F/sigF> for the working set and test set remain
close. Not sure how to judge which model is actually better, but it's
noteworthy that the FOM gets worse for *both* upward and downward
"optimization" of the test set.

--
After much deep and profound brain things inside my head,
I have decided to thank you for bringing peace to our home.
Julian, King of Lemurs

CCP4 Bulletin Board Archive

Wednesday, 2 November 2011

should the final model be refined against full datset

No comments:

Post a Comment

Followers