CCP4 Bulletin Board Archive: raw data deposition

From: Ed Pozharski
Date: 27 October 2011 17:08

I am curious as to what the collective opinion on the raw data
deposition actually is across the cross-section of the macromolecular
crystallography community subscribed to the bb. So, if you have a
second and a formed opinion on the idea of the depositions of the raw
data, please vote here

http://tinyurl.com/3qlwwsh

I'll post the results as soon as they look settled.

Cheers,

Ed.

--
"Hurry up before we all come back to our senses!"
Julian, King of Lemurs

----------
From: Gerard Bricogne

Dear Ed,

I am really puzzled by this initiative. It assumes that there is a
pre-formed "collective opinion" out there, independent from and unaffected
by the exchanges of views that have taken place on this BB, that would be
worth more than the conclusions we might reach by pursuing these exchanges.

The thread you are obviously deciding to dissociate yourself from was
initiated in response to a suggestion that views on this topic would
usefully be aired publicly on this BB rather than posted off-list to Tom
Terwilliger, who immediately agreed that this was a good idea and has been
very supportive of this discussion.

Shouldn't we continue to try and put our heads together to reach a
consensus, rather than collect opinions that may be little more than prior
prejudices?

What shall we gain by such a vote? I may be misunderstanding what you
have in mind, of course :-) .

With best wishes,

Gerard.

--

--

----------
From: Jacob Keller

One thing that the poll is useful for is something I find surprising:
~40% when I checked found storing images a waste of time. So, I guess
this might be useful for finding the "silent [significant] minority."
Why not have those folks chime in about why they think this is
useless, even to store images of solved datasets?

JPK

--
*******************************************
Jacob Pearson Keller
Northwestern University
Medical Scientist Training Program
*

----------
From: Susan Lea

I think the key is that the questions asks "is a waste of money".
In a straightened funding time it may just be that storing the raw images in addition to the processed
data doesn't float to the top of the list of "things that must be done whatever else happens in science".

Something can be desirable but just not come above a funding barrier.

Susan

Prof. Susan M. Lea

----------
From: Ethan Merritt

That kind of misses the point that the images may be of more value
to others (software developers, TDS wonks, Gloria's incommensurate
lattice hunt, ...) than they are to the person who originally collected them.

So "I don't need them" is different from "there is no point in saving them".

Ethan

--
Ethan A Merritt

----------
From: Gerard Bricogne

Dear Jacob,

I agree, of course, with the goal of giving everyone a voice, but
knowing that 40% of the voters find storing images a waste of time falls
short of knowing why they think so and taking their arguments into account.
Disagreeing without saying why when a topic is being actively discussed is a
position that does not contribute anything very constructive.

I think there should be an extra category of answer that would be
"I don't care", so that people who have no opinion do not get confused with
those who have an articulate position against the proposal, and wh should
then articulate it!

----------
From: Jacob Keller

In medical school, I found out that there could be a large population
in a class which was completely lost or completely disagreed with what
was being said, but there was only silence. When the lecturer would
pose a question, it would take a painful silence before anyone in the
100+ student class would hazard an answer (one can only speculate why
this was, but so be it.) Even a yes-no show of hands would yield only
~10% participation. Too much commitment? Anyway, the solution for many
lecturers was to pass out "clickers," which allowed the vote
anonymously for a given answer in a multiple choice question. Here we
have Ed's version of clickers for the ccp4bb. The problem is that
clickers are not very articulate--they just click!

I fully agree that those who disagree should articulate their
opinions--maybe subscribe anonymously if necessary? I don't like it
that people are so stymied, but this seems to be the way things are,
so the question is how to work with it.

Jacob

On Thu, Oct 27, 2011 at 11:48 AM, Gerard Bricogne

----------
From: Garib N Murshudov

I never thought that science should be done democratically. (Note, I voted to see results. Otherwise results are invisible). It would be unimaginable to decide by majority vote that a particular equation or theory is valid (e.g. relativity theory). I thought that storing data is a scientific question and should be tackled scientifically. You provide evidence, proof or proof of principle.

The most important question is repeatability of the experiment. Question is: how far should we go? I know that there is at least one case of overmerged data in the pdb. This particular question could be solved (only partially) if you deposit unmerged data, with images it is solved completely. Overmerging means averaging structures, thus losing differences between them (biologically important or not). Overmerging could be over translation (superlattice), rotation (higher space group) or both.

Has anybody ever done systematic analysis of pdb (even better data sets collected on one of the synchrotrons) to see the seriousness of the problem? I suspect the problem is much more serious than it is perceived.

Before you provide sufficient evidence everybody will have their opinion.

Garib

Garib N Murshudov
Structural Studies Division

MRC Laboratory of Molecular Biology

----------
From: Ed Pozharski

Dear Garib,

I am afraid clarification is in order.

Firstly, the results are available here

https://docs.google.com/spreadsheet/ccc?key=0Ahe0ET6Vsx-kdHh4cjdLZGZrSEpUOG9kV2hkb3ZXNHc

Click Form->Show summary to see the pie chart. This is so you don't
need to vote again to see the results (and please, don't vote more than
once anyway!). In my past experience, the results get more or less
final in a day or two or once the number of responses reaches ~300.

Secondly, it was not my intent to provide a "democracy-based argument".
Majority is often wrong.

Thirdly, it was not my intent to bias the results by carefully crafting
misleading/confusing options. Just disregard the part past "No". Or
provide you own reasons using the "Other" - I personally find that
category the most interesting.

Fourthly, my intent was to separate the discussion of "how to do it"
from "should we do it". I disagree with Garib somewhat that this is
purely scientific question, and perhaps it is open to some opinion. The
proposed changes will affect everyone (albeit in minor way), and my
ultimate intent is not to impose democracy but rather, as Jacob pointed
out, to potentially give voice to the silent faction. Garib is right
that we should approach the question scientifically, but it's important
to know if the issue is at all controversial. (In a strange way, the
smaller the minority is on either side the more important it seems to me
personally that every effort is made to assure that its position is well
understood).

Hope this clarifies things,

Ed.

--
Edwin Pozharski,
----------------------------------------------
When the Way is forgotten duty and justice appear;
Then knowledge and wisdom are born along with hypocrisy.
When harmonious relationships dissolve then respect and devotion arise;
When a nation falls to chaos then loyalty and patriotism are born.
------------------------------ / Lao Tse /

----------
From: Ed Pozharski

Sorry, the results in a pie-chart form are available here (but the
spreadsheet may be useful too if you want to see what is meant by
"other")

https://docs.google.com/spreadsheet/viewanalytics?hl=en_US&formkey=dHh4cjdLZGZrSEpUOG9kV2hkb3ZXNHc6MQ

--
Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
Julian, King of Lemurs

----------
From: Adrian Goldman

Um, I have thought about entering this thread at least a dozen times. I've started several comments and stopped all of them.

First, I am with the silent majority who doesn't think this data storage is a good idea (or not a good enough idea) but who hasn't responded till now. And let me say that, as this bb hardly reaches ALL practicing MM crystallographers, but only those with an interest in techniques, the results AND discussion are heavily skewed in favor of storage. At least that's what I think.

So - looking at my own navel - why would one, did I, not write until now? There is in the bb a loud active (and my guess) minority whose opinions are already formed, so responding seems pointless. It won't change anything and will just lead to opprobrium pouring down on my head. That's one reason.

But let me say - and I voted 'no' as should be blindingly obvious - two more things.
1) this is not a matter of science, but science (internal) policy, and so the majority actually SHOULD count.
2) I agree with Susan. In a time of limited funding, is this the most important use of money?
This point was made in a news-and views I recently read but cannot find despite an hour of searching - we as a species are not good at judging the opportunity costs implicit in choices. There are plenty implicit in this choice, would it not, for instance, be MUCH more useful to finally get the modellers to release their source code?

But enough of the nattering nabobs of negativism! As such frame information is so valuable for future development efforts, I think all it would require would be an email to a local crystallographer working on an impossible problem, and I am sure it would be forthcoming. For s/w development purposes, I can't believe that even a small fraction of the terabytes of frame data off the pilots is needed...

Adrian Goldman

Sent from my iPad

----------
From: Craig A. Bingman

I strongly suspect that it is much more cost effective to have the PDB archive a unit of data than it is to have it archived at the lab or department level. So I suspect that more money will be available for doing science if we turn over archival responsibilities for image data to the kind folks at the PDB, who really know what they are doing and capture efficiencies of scale that are out of reach for most labs.

----------
From: D Bonsor <

Why should we store images?

From most of the posts it seems to aid in software development. If that is the case, there should be a Failed Protein Databank (FPDB) where people could upload datasets which they cannot solve. This would aid software development and allow someone else to have ago at solving the structure.

If it is for historical reasons, how can someone decide whether their structure is historical? I would propose that images should be uploaded for a protein or protein-complex that has never be solved before. That way the images are there if that structure does become historical.

The question is not whether or not images should be uploaded but who would use the images that were uploaded.

For example, people who use crystallography as a tool to aid in characterization of their protein, would probably not look at images for 99.5% of other protein datasets, and they probably would not look at images for a protein that is related to their own protein. They are more interested in the final structure. I too would probably not be interested in reprocessing and solving a structure again when I can easily access the final product already.

----------
From: Nat Echols

It's worth keeping in mind that there was once strong opposition to the current rules on PDB deposition - the best example I could find is here:

http://www.nature.com/nsmb/journal/v5/n6/pdf/nsb0698-407.pdf

Notably, nearly a third of scientists polled thought they should be allowed to publish without releasing coordinates. If this had been a majority, should the journal editors have meekly submitted and allowed the old policy of 1-year holds to continue? Admittedly, the issue of archiving raw images is not the same, since they are of much less use to the community, but it's a good example of why some opinions should be ignored.

-Nat

----------
From: Ed Pozharski

This is my response to Gerard, originally off-list, but which he feels
needs to be made public.

Dear Gerard,

1. I think any opinion (collective or individual) by now is affected by
the ongoing discussion.
2. I am not sure how this would make the discussion less public.
3. Yes, we should continue to seek consensus, but perhaps it may be
useful to see if the consensus already exists. Or if the proposition of
storing the raw data (which I personally support, but haven't always)
faces strong headwinds.
4. Any online poll is always skewed towards the people who care enough
about the issue. My concern was that I don't really know how many
people support this. Granted, the right decision is not always
supported by the majority and "democracy" has its obvious limits, but
what we gain by this is some idea as to where people actually stand.

My hope is that the poll would show that most people support the
deposition of raw images. This should presumably help your argument
(which, again, I wholeheartedly support) that this has to be done. If
it shows the opposite... well, then we have the work to do to convince
them. And perhaps listen to what their arguments are.

I think there are two questions regarding the raw data deposition (not
necessarily in that order):

1. How to do it.

That is what the other thread is dealing with and my overall feeling is
that difficulties have been largely exaggerated early on. You are right
that concrete steps can be taken.

2. Do we need to do it.

To me, it's no-brainer, but some responses seem to suggest not everyone
is really on board. Again, I am sure this has to be done, but consensus
in this area is equally important.

HTH,

Ed.

----------
From: Thomas C. Terwilliger

There are many reasons why storing images can be useful, but one is the
ability to re-analyze the data for a structure, or for all structures, in
a systematic and improved way.

I imagine that in a few years the PDB-REDO approach to rebuilding
structures will be extended to complete redetermination of all structures
on a regular basis. The resulting structures will continuously improve,
and each new redetermination will be stored so that a static view can be
referenced.

The data for these structures will remain constant, the interpretation
will change (presumably an improvement).

The logical continuation of this approach will be to move back from merged
data to unmerged data, and then to raw images. Surely we will develop
improved methods for analysis of images, and structures (or perhaps
details of structures) will improve.

Surely also some structures that were determined with less than optimal
care today will become accurate structures in this way.

-Tom T

The raw data for

----------
From: Michel Fodje

We store raw data for two main reasons:
a) We currently use only a fraction of the information actually contained in raw images and extraction of that fraction can be improved. Destroying the data means
- we lose the extra information, and make future research in some areas either impossible or more costly
- we make it more difficult to improve current data reduction methods
b) Raw data is the best way to independently validate a published structure and prevent fraud.

The majority of crystallographers already recognize these truths. That is why almost all of them do keep backups of their data even after structures have been published.

To those still against making data public I would ask a simple question: Would you object to providing the raw data from a published structure if such data were available and you did not have to bear an unreasonable inconvenience in the process? My guess is that most crystallographers are reasonable scientists and such a "Poll" will probably result in ~100% "Yes" and ~0% "No". I'm I wrong?

The real issue then is how do we make the data available in such a way that the inconvenience (if any) to all the stake-holders is reasonable. Some great ideas have already been advanced.

In the short-term, we could start by using the fact that synchrotron facilities already store raw data for a period. However, a lot of data is collected which is not published. Given the limited disk space, it may be useful to know exactly which datasets result in a publication and should be kept for an extended period. If a unique ID (such as the DOI suggestion) is provided to every dataset and required during deposition/publication, then synchrotron facilities can preserve only those datasets which have been published after a given "grace" period. Combined with a central Meta-data server similar to TARDIS, such a system could be developed in a relatively short period of time, while longer term central storage ideas are worked out.

Again the best solution is going to be one which requires the least amount of effort from crystallographers. In fact, I can see a system in which the experiment metadata for a PDB entry/dataset comes directly from the synchrotron facility during deposition so that users simply provide a unique dataset ID and the experimental details are pre-filled for them.

Of course the above completely ignores home sources.

/Michel

----------
From: Jacob Keller

Since this hasn't been brought up--there is the consideration that in
10 or more years maybe x-ray crystallography will be completely a
thing of the past, with some kind of massively-superior modality
taking over. Of course there is no way to bank on this, but I am
wondering whether this is something to consider or not. Do we really
think people will still be crystallizing proteins in 50 years, or far
less, looking up structures determined in 2011? Has anybody recently
used the original myoglobin structure?

JPK

----------
From: Ed Pozharski

Dear Adrian,

thank you - this is most helpful in assessing why we do or don't need to
deposit the raw data.

However:
Fair enough. I am afraid we need another survey to settle this (oh,
no!). But given the variety of questions that are asked on this bb it
is my expectation that the subset is quite representative. I also
thought that "protein crystallographer" implies interest in techniques.
If those with interest in techniques are nowadays a minority... well,
let's just say it explains a lot.
I can't, of course, pretend to be the spokesperson of the loud minority
(but it's surely true that I am occasionally loud and obnoxious). To
summarize my personal feelings about the issue you raise, let me quote
Voltaire:
"I do not agree with what you have to say, but I'll defend to the death
your right to say it."
Agreed with the caveat that majority could be wrong.
This is an important point, but I suspect that a) most of the task can
be accomplished within existing framework (thus no extra personnel
costs) and b) the extra storage is really, really cheap these days -
even if I store all the data we collect, it is probably still less than
dewar shipping costs.
IMHO, this is not about developers getting data to work with (I am sure
they already have plenty). It's about extending the retro-processing
concept pioneered by PDB-REDO to the integration/scaling. And yes, it
is about preventing wishful overinterpretation. While this is not about
raw data processing, will correcting someone fitting an alpha helix
with multiple di-EG molecules change the course of history? Of course
not. It won't even change the main finding of that paper. But I always
believed in getting the best structure possible under the circumstances
(and, of course, failed to live up to that standard).

But I digress, as usual. Once again - thank you, I think it's very
important that these issues are discussed. If the raw data deposition
is made mandatory (which I support), I'd like you to at least see my
reasoning, if not bring you over to the dark side.

Cheers,

Ed.

--
"I'd jump in myself, if I weren't so good at whistling."
Julian, King of Lemurs

----------
From: Michel Fodje

Every dataset costs money to produce. Is it more cost effective to expect that those wishing to use the data repeat the expenditures by repeating the experiments? To exaggerate the point, imagine a world without published research articles, would it be more expensive to do science or less? We should not simply dismiss an idea just because we think today that "640K is more memory than anyone will ever need"

----------
From: Gerard Bricogne

Dear Nat,

You are making an excellent point, that I would like to supplement with
another drawn from an intermediate stage between making compulsory the
deposition of coordinates (to which you are referring) and the discussion we
are having right now about moving towards the deposition of diffraction
images - namely, the deposition of "structure factor" data.

At first that idea seemed to many to be just as far-fetched as the
current one is seen by many. I can remember an impassioned e-mail to this BB
by Gerard Kleywegt with subject line "SOS: save our structure factors!",
pleading the case for that deposition to be made cmpulsory so as to be make
it possible to have as objective a picture as possible of the quality of the
electron density on which the model was based; and he went on to produce the
Electron Density Server, the usefulness of which few would now dispute.
There are probably few instances in which the EDS could be proven to have
led to "significant new biological insights", but it is undeniable that it
must have provided very useful means of checking deposited structures to see
whether there might be questionable bits in crucial regions, whereas
previously one would have had to believe indiscriminately everything that
was modelled.

This structure factor deposition also led to the possibility of
large-scale testing of new developments in refinement algorithms which
played a huge role in helping improvements in those to be throroughly
evaluated, and the programs to be made robust. This led in turn to being
able to see more detail or more corrections in old pdb entries via the EDS,
culminating in such initiatives as PDB-redo that, if not revolutionising the
biological information content of the pdb, has certainly helped make its
contents much more assessable. Through the effect on the improvement of
refinement programs, it can be said that the greatest beneficiaries of the
deposition of structure factors yesterday are not so much the people who
deposited the associated structures at the time, but everyone who refines
structures today and will do so tomorrow with the much improved programs it
has helped produce.

We are simply today at the logical next step, i.e. depositing the
images that the structure factors came from. For many reasons that have been
described by many people, images often contain much more information about
the reliability (or otherwise) of the structure factors derived from them (I
have repeatedly mentioned the corruption by reflexions from parasitic
lattices). Such images will not only provide the foodstuff for new
developments aimed at dealing better with the problem: once those
developments have taken place, more reliable data will be obtainable from
them, that may frequently clean up dubious features of the previous maps or
bring into question certain parts of the previous models. I think that
Adrian's rather dismissive comment that developers can get the job done from
a few scraps of bad images gleaned from colleagues in distress is simply a
sign of a lack of experience in developing software.

We should not, therefore, be too blinkered and ask only "What will it
do for my structure if I deposit my images", but instead ask "What will
depositing my images do to improve the processing and refinement programs of
tomorrow" (I am not trying to sound like JFK here ...). The answer is: an
awful lot! These improvements will then help everybody, including the
sceptical depositor in question in his or her next tough project; but as
usual they will be taken for granted by those who thought that depositing
images was a waste of time ... .

I hope this elicits more comments from doubters and detractors: their
voices and arguments should certainly be heard.

----------
From: Adrian Goldman

Ok. This is my last post before I go to bed. Look at the opportunity cost of this discussion alone - bright minds who should be solving structures or developing algorithms - anything! Debating this.

However - as someone else remarked will (a) anyone care about > 90% of the structures in 50 years?

And (b) even if they do, is this continual improvement even worthwhile? I am always depressed at how little a model changes from an initial build to the final one, even when the rfree drops from 35 to 23. All that work! - and my biological interpretation would have been almost the same at the beginning as at the end.

The structures aren't important in themselves. It's the story they tell. So to me this is an effort to fix what ain't broke.

Adrian

Sent from my iPhone

----------
From: Gerard Bricogne

Dear Adrian,

I too follow Voltaire, and your point of view nicely illustrates the
diversity of outlook and priorities between practitioners of our arcane art.

I can only say that I have seen many cases where structural detail only
obtainable through hard work in phasing and/or refinement has produced
conclusions that had a significant impact on the ambient biological story,
and that this hard work would not have been possible if no one had cared
about continuing to develop ever better methods and software.

----------
From: Francis E Reyes

Thanks for bringing this up front Ed. Specifically bringing your second point to the forefront. Do we need to do it? Or to rephrase it more directly .. WHY do we need to do it?

Answering why we need to do it will really help with compliance. Lest we not forget we are asking the general crystallography community (which encompasses a large variety of interests in competition with the interest to archive the actual images) to go an additional step and provide detailed metadata (among other things). Of course you could force the community into compliance but I'm pretty sure we can motivate behavior without threats.

So I ask again, are there literature examples where reevaluation of the crystallographic data has directly resulted in new biological insights into the system being modeled?
---------------------------------------------
Francis E. Reyes M.Sc.
215 UCB
University of Colorado at Boulder

----------
From: James Stroud

This is a poor criterion on which to base any conclusions or decisions. We can blame the lack of examples on unavailability of the data.

Right now, I'd love to get my hands on the raw images for a particular cryoEM data set, but they are not available--only the maps. But the maps assume one symmetry and I have a hypothesis that the true symmetry is different. I could test my hypothesis by reprocessing the data were it available.

James

----------
From: Katherine Sippel

Generally during these rigorous bb debates I prefer to stay silent and absorb all the information possible so that I can make an informed decision later on. I fear that I am compelled to contribute in this instance. In regards to the "does this make a difference in the biological interpretation stage" issue, I can state that it does. In my comparatively miniscule career I have run into this issue three times. The first two address Adrian's point...

In one instance I adopted an orphaned structure and ran it through a slightly more advanced refinement protocol (on the same structure factors) and ended up with a completely different story than the one I started with [1]. Another researcher in my grad lab identified mis-oriented catalytic residues in an existing structure from EDS server maps which affects the biochemistry of the catalytic mechanism [2].

In another case I decided that I would reprocess some images that I had originally indexed and scaled in my "Ooo buttons clicky clicky" stage of learning crystallography and the improved structure factors revealed a alternate conformations for both a critical loop and ligand orientation [3].

And this was all in the last 4 years. So I would posit that the answer is yes there are significant biological insights to be had with the capacity to reassess data in any form.

Katherine

[1] J Phys Chem Lett. 2010 Oct 7;1(19):2898-2902
[2] Acta Crystallogr D Biol Crystallogr. 2009 Mar;65(Pt 3):294-6.
[3] Manuscript in progress

------------
Katherine Sippel, PhD
Postdoctoral Associate
Baylor College of Medicine

----------
From: Petr Kolenko

Dear colleagues,

my opinion is that we should develop methods or approaches to validate
!processing! of raw data. If this is possible. We have many validation
tools for structure refinement, but no tool to validate data
processing. In case we have this tools, there is no need to deposit
diffraction images (2-5GB instead of 10 MB). I think.
Of course, how to validate this? This might be topic for a new
discussion. I am sure, that in the very beginning of crystallography,
there were no tools to validate the structures as well. I am also sure
that some opinions may arise today. (Online server, where one can
upload the images with log files from processing?)
We should concentrate more on quality of our outcome, than on storage
of these data.

Petr

----------
From: Vellieux Frederic

I'd be careful there if there was a motion to try to implement a policy at SR sources (for academic research projects) to make it compulsory to publically release all data frames after a period (1 year ? 2 years ? 4 years) during which you are supposed to solve the structures you have collected the data for, so that others can have a go at it (and solve the structures "for you"):

you may find yourself for example in between grants and need to spend all of your time looking for funding for a couple of years, with little or no staff working with you. With the trend we see of ever diminishing resources, this would mean that the very large and well funded labs and groups would solve their own structures, and solve those of smaller groups as well (and publish the latter). This would then mean (after a while) the concentration of macromolecular crystallography to only the "lucky few" who have managed to secure large grants and will therefore go-on securing such grants. You could call that "evolution" I suppose.

We are already in a situation where the crystallographers who solved the structures are not necessarily authors on the publications reporting the structures... so is it time to go back to home sources (X-ray generators) for data collection ?

Fred.

----------
From: Gerard Bricogne

Dear Fred,

Frankly, with respect, this sounds to me like fanciful and rather
non-sensical paranoia. The time frame for public disclosure of all SR data
has been quoted at 5 years, or something of that order. If someone has been
unable to solve a structure 5 years after having collected data on it, then
it does make perfect sense that he/she be "rescued" in one way or another.
Any responsible scientist in that situation would have called for specialist
help long before then, and having failed to do so would indicate a loss of
interest in the project anyway.

This is again the type of argument that strays away from a serious
question by throwing decoys around the place. Of course such views must be
heard, but so should the counter-arguments of those who disagree with them.

----------
From: Boaz Shaanan

Hi Katherine,

It sounds as if you had all you needed to correct other people's (and your own) errors, as you described, in the existing database (EDS, PDB) or your own data, right? That hardly justifies establishing a new database of which at least 80-90% is worthless. Furthermore, since much of the non-indexable data arise from experimenter's faults, is it not the time to start a discussion (preferably prior to setting up a committee) on deposition of crystals so that anybody can have a go at them to detect problems if they wish?

Cheers,

Boaz

----------
From: Gerard Bricogne

Dear Remy,

You are right, and I was about to send a message confessing that I had
been rash in my response to Fred's. Another person e-mailed me off-list to
point out that sometimes a structure can be quickly solved, but that doing
all the rest of the work involved in wrapping that structure into a good
biological story for publication can take a very long time, and that it
would be wrong for a SR source's forced disclosure policy to start imposing
deadlines on that process. I entirely agree with both of you and admit that
I reacted too quickly and with insufficient thought to Fred's message.

However, as you point out yourself, this issue is related to a
different question (SR sources' disclosure policy towards all data collected
on their beamlines) from the original one that started this thread
(deposition of raw images with the pdb entries they led to). The two topics
became entangled through the idea of prototyping an approach to the latter
by tweaking the storage and access features involved in the former.

Many thanks to you and to the other correspondent for picking up and
correcting my error. This however leaves the main topic of this thread
untouched.
On Fri, Oct 28, 2011 at 01:38:29PM +0200, Remy Loris wrote:
> Dear Gerard,
>
> I cannot agree. Last year my group published a paper in Cell which
> contained a structure for which the native data were collected at a
> synchrotron around 1997. Various reasons contributed to the long lag period
> for solving this structure, but basically it all came down to money needed
> to do the work. Equally I am sure there are other cases for which a first
> good native data set is a breakthrough you wish to protect rather than hand
> it out to anyone who might potentially scoop you after you have put lots of
> money and effort into the project.
>
> Therefore: Images corresponding to structures I deposit in the PDB: No
> problem. That is what we do with processed data as well. But images of
> unsolved structures, I don't see why that should be enforced or done
> automatically by synchrotrons. Nobody deposits processed data without an
> accompanying structure either.
>
> I do agree that one could be given the option to deposit interesting data
> with which he/se will not continue for whatever reason. But this should be
> optional, and a clear consensus should emerge within the community as how
> the original producers of the data have to be acknowledged if these data
> are used and the results published by another team, especially if the use
> of that particular dataset is crucial for the publication.
>
> Remy Loris
> Vrije Universiteit Brussel and VIB
>
>
> Op 28/10/2011 11:54, Gerard Bricogne schreef:

----------
From: Vellieux Frederic

I must "say" that there were some emails exchanged between me and Gerard later, in which I pointed out that I wasn't against deposition of images (data frames). In fact, if SR sources kept user's data there would be one more structure from here in the PDB: HDD failure here, the data on a mirror HDD but the company in charge of maintenance erased the data frames and data processing statistics by accident. For a trypanosomal enzyme there is no chance that I can ever get funding now to replicate the work (protein production and purification, crystallisation, data collection) so that "Table 1" could be produced for a manuscript.

However, my email to the bb was provocative - I admit I was doing this willingly - to write that in such harsh funding times someone could start a career, get some small grant, enough to clone produce purify crystallize and collect a first data set. And then find him or herself without funding for X years (success rate = less than 10% these days). If this person then gets scooped by whoever, end of a promising career. Unfortunately, such a prospect doesn't seem to be science fiction any more nowadays. I hope this clears things. I wanted to be provocative and point out the difficulties we are all facing wrt funding so that we shouldn't set up a system that may result in killing careers. Our politicians do not need any help from us on that I think.

Fred.

----------
From: Katherine Sippel

Hi Boaz,

I see your point in regards to making a database of "all" diffraction images. The argument I clearly failed to make effectively was that improvement of structures can frequently yield useful biological information which is why I believe that, at the least, images of deposited structures should be archived. The availability of the structure factors alone can allow crystallographers to improve models significantly, but there is always the question of whether there was more data lost to button smashing despite developers efforts to make data processing idiot-proof.

If I am going to invest years of my life and millions of tax dollars on a hypothesis derived from a structure I personally would be willing to take a day to reprocess the images and put the model through it's paces to ensure that I'm not wasting my time and/or other people's money.

Yes experimenters (including myself) make mistakes, but the joy of crystallography is that we can effectively backtrack, identify where the mistake was made, fix it and learn from it. All it costs us is a couple of hours and perhaps a little pride whereas the result is stronger more effective science for our field as a whole. In my mind that equalizes out the cost:benefit ratio considerably.

Of course this is all just my opinion,

Katherine

----------
From: Jacob Keller

What about a case in which two investigators have differences about
what cutoff to apply to the data, for example, A thinks that Rsym of
50 should be used regardless of I/sig, and B thinks that I/sig of 2
and Rpim should be used. Usually A would cut off the data at a lower
resolution than B, especially with high multiplicity, so B would love
to have the images to see what extra info could be gleaned from a
higher-res cutoff. Or the converse, A is skeptical of B's cutoff, and
wants to see whether the data according to A's cutoff justify B's
conclusion. Don't these types of things happen a lot, and wouldn't
images be helpful for this?

JPK

----------
From: Jürgen Bosch

Which trps protein check the MSGPP or SGPP website they might have what you are looking for.

Jürgen

----------
From: Gerard Bricogne

Dear Jacob,

See the paper by J. Wang cited at the end of Francis Reyes's message
under this thread yesterday: it is a case of exactly what you are talking
about.

----------
From: Boaz Shaanan

Hi Jacob,

There is (very) BIG difference between depositing images for deposited structures and depositing all images ever recorded by any crystallographer on the planet. In the case you presented, A and B can settle the issue by looking at each other's images whether through the database or by exchanging data on their own initiative or even by writing a note to a journal that they completely disagree with one another and start a debate (in case one of them is not willing to exchange images). Besides, I thought that by now there are some standards on how data should be processed (this has been discussed on this BB once every few months, if I'm not mistaken). Isn't that part of the validation process that so many good people have established? Also, to the best of my knowledge (and experience) referees (at least of some journals) are instructed to look into those issues these days and comment about them, aren't they?

----------
From: Jacob Keller

I don't think anyone is considering archiving all images *as a first
step*! I think the obvious first step is to try to get depositors of
new structures to the PDB to deposit images at the same time, which to
me seems trivially easy and has a reasonably high benefit:cost ratio.
Let's just do it, maybe even making it optional at first. I don't even
think that in the beginning a common image format would be
required--most programs can use multiple formats. Anyway, that could
be addressed down the line.

Jacob

----------
From: Katherine Sippel

I have said my piece of the issue of depositing but there is one comment I would like to address.

There is a big difference between data that is processed correctly and data that is processed well. I'm reminded of a Yogi Berra quote "In theory, theory and practice are the same. In practice, they are not." Every year our grad level crystallography course would take the same lysozyme data set and break into groups of two to process them independently in HKL2000 and every year we'd get a vastly different array of statistics. All of the processing would produce valid structure factors that were essentially the same and all would pass SFCHECK. The difference in the numbers varied for many reasons including the choice of reference image, initial spot number picking, profile fitting radius, spot size, the use of 3D profile fitting, choice of scaling factors and whether appropriate sacrifices had been made to the Denzo gods. And though the overall backbone and structure remained mostly the same there were clearly some who had better maps than the others.

So yes there is a standard protocol in place and it can identify and correct gross error but by no means does that indicate the data was processed well.

Sincerely,
Katherine

----------
From: Francis E Reyes

Agreed. Reprocessing the data resulting in a a different biological result is my personal reason and motivates me to support raw data deposition. However, I'm not the one that needs convincing.. it's the other PI's/graduate students/postdocs (who may see MX as a simple tool to get an answer for some larger question), who need to see the value of depositing raw MX images.

The point of my email is to elicit other people's reasons why raw data deposition is necessary.
and thank you for providing yours .

F

----------
From: Ethan Merritt

If this is true, I must not have got the memo!

I hear differences of opinion among senior crystallographers, even just
considering discussions at our local research meetings, let alone in the
context of world-wide practice.

- Where to set a resolution cutoff?
- Use or not use a criterion on Rmerge (or Rpim or maximum scale factor or
completeness in shell)?
- Use all images in a run, or limit them to some maximal amount of decay?
- Empirical absorption correction during scaling?
- XDS? HKL? mosflm?
As to what reviewers have access to, at best one sees a "Table 1" with
summary statistics. But rarely if ever do we see the protocol or
decisions that went into the processing that yielded those statistics.

And more to the point of the current issue, a reviewer without access
to the original diffraction images cannot possibly comment on
- Were there unexplained spots that might have indicated a supercell
or other strangeness with the lattice?
- Evidence of non-merohedral twinning in the diffraction pattern?
- Was the integration box size chosen appropriately?
- Did the diffraction data clearly extend beyond the resolution limit
chosen by the authors?

I hasten to add that I am not advocating for a requirement that the
diffraction images be sent to reviewers of a manuscript!
But these are all examples of points where current opinion differs,
and standard practice in the future may differ even more.
If the images are saved, then the quality of the data extracted from
them may improve using those not-yet-developed programs and protocols.

So there is, to me, clearly some value in saving them.
How to balance of that value against the cost? - that's another question.

----------
From: Robbie Joosten

Hi Francis,

Even though they are not published, there are enough models in the PDB for
which reevaluation of the crystallographic data leads to new biological
insight. Unfortunately, a lot of the insight is of the type "that ligand
doesn't really bind, or at least not in that pose". Another nice one is a
sequencing error in a Uniprot entry that became obvious after critically
looking at the structure and the maps (the authors, of both structure and
sequence, acknowledge the problem, but the entry is not yet fixed, so no
names). Yesterday, I had a case where I didn't so much mistrust the model,
but I would still have liked to have access to the images. There was
something weird in the maps that was also clearly there in pictures of the
maps in the linked publication, but it was not discussed.

Needless to say, I'm in favour of depositing images. At least for published
structure models. There is still a lot of interesting things to find in
current and future PDB entries.

Cheers,
Robbie
> James Stroud
> Sent: Friday, October 28, 2011 07:57

CCP4 Bulletin Board Archive

Saturday, 19 November 2011

raw data deposition

No comments:

Post a Comment

Followers