Saturday, 26 November 2011

Archiving Images for PDB Depositions

From: Jacob Keller
Date: 31 October 2011 16:02


Dear Crystallographers,

I am sending this to try to start a thread which addresses only the
specific issue of whether to archive, at least as a start, images
corresponding to PDB-deposited structures. I believe there could be a
real consensus about the low cost and usefulness of this degree of
archiving, but the discussion keeps swinging around to all levels of
archiving, obfuscating who's for what and for what reason. What about
this level, alone? All of the accompanying info is already entered
into the PDB, so there would be no additional costs on that score.
There could just be a simple link, added to the "download files"
pulldown, which could say "go to image archive," or something along
those lines. Images would be pre-zipped, maybe even tarred, and people
could just download from there. What's so bad?

The benefits are that sometimes there are structures in which
resolution cutoffs might be unreasonable, or perhaps there is some
potential radiation damage in the later frames that might be
deleterious to interpretations, or perhaps there are ugly features in
the images which are invisible or obscure in the statistics.

In any case, it seems to me that this step would be pretty painless,
as it is merely an extension of the current system--just add a link to
the pulldown menu!

Best Regards,

Jacob Keller

--
*******************************************
Jacob Pearson Keller
Northwestern University
Medical Scientist Training Program
*******************************************

----------
From: Adrian Goldman


I have no problem with this idea as an opt-in. However I loathe being forced to do things - for my own good or anyone else's. But unless I read the tenor of this discussion completely wrongly, opt-in is precisely what is not being proposed.

Adrian Goldman


----------
From: Jacob Keller


Pilot phase, opt-in--eventually, mandatory? Like structure factors?

Jacob

----------
From: Frank von Delft


"Loathe being forced to do things"?  You mean, like being forced to use programs developed by others at no cost to yourself?

I'm in a bit of a time-warp here - how exactly do users think our current suite of software got to be as astonishingly good as it is?  10 years ago people (non-developers) were saying exactly the same things - yet almost every talk on phasing and auto-building that I've heard ends up acknowledging the JCSG datasets.

Must have been a waste of time then, I suppose.

phx.

----------
From: Clemens Vonrhein


Dear Adrian,
I understood it slightly different - see Gerard Bricogne's points in

 https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1110&L=CCP4BB&F=&S=&P=363135

which sounds very much like an opt-in? Such a starting point sounds
very similar to that we had with initial PDB submission (optional for
publication) and then structure factor deposition.

Cheers

Clemens

--

***************************************************************
* Clemens Vonrhein, Ph.D.     vonrhein AT GlobalPhasing DOT com
*
*  Global Phasing Ltd.
*  Sheraton House, Castle Park
*  Cambridge CB3 0AX, UK
*--------------------------------------------------------------
* BUSTER Development Group      (http://www.globalphasing.com)
***************************************************************

----------
From: Anastassis Perrakis

Dear all,

Apologies for a lengthy email in a lengthy chain of emails.

I think Jacob did here a good job refocusing the question. I will try to answer it in a rather simplistic manner,
but from the view point of somebody who might only have relatively little time in the field, but has enjoyed the 
privilege of seeing it both from the developer and from the user perspective, as well as from environments
as the synchrotron service-oriented sites, as well as from a cancer hospital. I will only claim my weight=1 obviously,
but I want to emphasize that where you stand influences your perspective.

let me first present the background that shapes my views.

<you can skip this>

When we started with ARP/wARP (two decades for Victor and getting pretty close for myself!), we (like others) hardly
had the benefit of large datasets. We had some friends that gladly donated their data to us to play with,
and we have assembled enough data to aid our primitive efforts back then. The same holds true for many.

At some point, around 2002, we started XtalDepot with Serge Cohen: the idea was to systematically collect phased data,
moving one step away from  HKL F/SigF to include either HLA/B/C/D or the search model for the molecular replacement solution.
Despite several calls, that archive only acquired around hundred structures, and yesterday morning was taken off-line 
as it was not useful any more and was not visited by anyone any more. Very likely, our effort was redundant because of the JCSG
dataset, which has been used by many and many people who are grateful for it (I guess the 'almost every talk' of Frank refers to me, 
I have never used the JCSG set).

Lately, I am involved to the PDB_REDO project, who was pioneered by Gert Vriend and Robbie Joosten (who is now in my lab).
Thanks to Gerard K. EDS clean-up and subsequent effort of both Robbie and Garib who made gadzillions of fixes to refmac,
now we can not only make maps of PDB entries, but also refine them - all but less than 100 structures. That has costed a significant part of 
the last four-five years of Robbie's life (and has received limited appreciation from editors of 'important' journals and from referees of our grants).

</you can skip this>

These experiences are what shapes my view, and my train of thought goes like this:

The PDB collected F/sigF, and to be able to really use them to get maps first, to re-refine later, and re-build now, has received rather
limited attention. It starts to have impact to some fields, mostly to modeling efforts and unlike referee nr.3 I strongly believe it
has a great potential for impact.

My team collected also phases, so did JCSG in a more successful and consistent scale, 
and that effort has been used indeed by developers to deliver better benchmarking
of many software (to my knowledge it has escaped my attention if anyone used JCSG data directly for eg by learning techniques,
but I apologize if I have missed that). This benchmarking of software, based on 'real' maps for a rather limited set of data,
hundreds and not tens of thousands, was important enough anyway.

That leads me to conclude that archiving images is a good idea on a voluntary basis. Somebody who needs it should convince the funding bodies
to make the money available, and then take the effort to make the infrastructure available. I would predict then 100-200 datasets would be collected,
and that would really really help developers to make these important new algorithms and software we all need. Thats a modest investment,
that can teach us a lot. One of the SG groups can make this effort and most of us would support it, myself included.

Would such data help more than the developers? I doubt it. Is it important to make such a resource available to developers? Absolutely?
What is the size of the resource needed? Limited to a few hundreds of datasets, that can be curated and stored on a modest budget.

Talking about archiving in a PDB-scale might be fantastic in principle, but it would require time and resources to a scale that would not clearly stand the 
cost-benefit trial, especially at times of austerity.

In contrast, a systematic effort of our community to deposit DNA in existing databanks like AddGene.com, and annotate PDB entries with such deposition
numbers, would be cheap, efficient, and could have far-reaching implications for many people that could really get easily the DNA to start studying 
structures in the database. That would surely lead to new science, because people interested enough in these structures to claim the DNA and 
'redo' the project would add new science. One can imagine even SG centers offering such a service 'please redo structure X for this and that reason',
for a fee that would represent the real costs, that must be low given the investment already existing in experience and technology over there - a subset
of targets could be on a 'request' basis...

Sorry for getting wild ... we can of course now have a referendum to decide in the best curse of action! :-(

A.

PS Rob, you are of course right about sequencing costs, but I was only trying to paint the bigger picture...



Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member
Department of Biochemistry (B8)
Netherlands Cancer Institute, 
Dept. B8, 1066 CX Amsterdam, The Netherlands




----------
From: Anastassis Perrakis


To avoid misunderstandings, since I received a couple of emails already:

? was a typo. I meant Absolutely!
 I think such data are essential for development of better processing software, and I find the development of better
processing software of paramount importance!
Curse was not a typo.
I am Greek. Today, thinking of referendums, I see many curses of action, and limited courses of action.

A.

----------
From: George M. Sheldrick


Speaking as a part-time methods developer, I agree with Tassos that a couple
of hundred suitably chosen and documented datasets would be adequate for most
purposes. I find that it is always revealing to be able to compare a new
algorithm with existing attempts to solve the same problem, and this is much
easier if we use the same data when reporting such tests. Since I am most
interested in phasing, all I need are unmerged reflection datasets and a PDB
file of the final model. It would be a relatively small extension of the
current deposition requirements to ask depositors to provide unmerged
intensities and sigI for the data collected for phasing as well as for the
final refinement. This would also provide useful additional information for
validation (even where experimental phasing failed and the structure was
solved by MR).

George
--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,

----------
From: Gerard Bricogne


Dear Tassos,

    If you apologise for a long e-mail in a long chain of them, I don't
know with what oratory precautions I should preface mine ... . I will
instead skip the disclaimers and try to remain brief.

    It seems to me that there is a slight paradox, or inconsistency, in
your position. You concur with the view, expressed by many and just now
supported by George, that developers could perfectly well do their job on
the basis of relatively small collections of test datasets that they could
assemble through their own connections or initiative. I mostly agree with
this. So the improvements will take place, perhaps not to the same final
degree of robustness, but to a useful degree nevertheless. What would be
lost, however, is the possibility of reanalysing all other raw image
datasets to get the benefit of those new developments on the data associated
with the large number of other pdb entries for which they would have been
deposited if a scheme such as what has been proposed was put in place, but
will not have been otherwise.

    Well and good, have said many, but who would do that anyway? And what
benefit would it bring? I understand the position of these sceptics, but I
do not see how you, dear Tassos, of all people, can be of this opinion, when
you have just in the previous sentence sung the praises of Gert and Robbie
and PDB-REDO, as well as expressed regret that this effort remains greatly
underappreciated. If at the time of discussing the deposition of structure
factor data people had used your argument (that it is enough for developers
to gather their own portfolio of test sets of such data from their friends
and collaborators) we could perhaps have witnessed comparable improvements
in refinement software, but there would have been no PDB-REDO because the
data for running it would simply not have been available! ;-) . Or do you
think the parallel does not apply?

    One of the most surprising aspects of this overall discussion has been
the low value that seems to be given, in so many people's opinion, to that
possibility of being able to find improved results for pdb entries one would
return to after a while - as would be the case of a bottle of good wine that
would have matured in the cellar. OK, it can be a bit annoying to have to
accept that anyone could improve one's *own* results; but being able to find
better versions of a lot other people's structures would have had, I would
have thought, some value. From the perspective of your message, then, why
are the benefits of PDB-REDO so unique that PDB-REPROCESS would have no
chance of measuring up to them?


    That is the best I managed to do to keep this reply brief :-) .


    With best wishes,

         Gerard.

--
--

    ===============================================================
    *                                                             *
    * Gerard Bricogne                     gb10@GlobalPhasing.com  *
    *                                                             *
    * Sheraton House, Castle Park      
    * Cambridge CB3 0AX, UK            
    *                                                             *
    ===============================================================

----------
From: Edward A. Berry

Gerard Bricogne wrote:

. . . . the view, expressed by many and just now

Well, lets put it to the test-
Let one developer advertise on this board a request for the type
of datasets (s)he would like to have as test case for current project.
The assumption is that it is out there. See whether or not people
recognize their data as fitting the request and voluntarily supply
it, or we need this effort to make all data available and (what would
be more burdensome) annotate it sufficiently so the same developer
looking for a particular pathology would be able to find it among
the petabytes of other data.

I seem to remember two or three times in the past 18 years that
such requests have been made (and there is the standing request to make
data submitted to the aRP/wARP server available to the developers),
and i assumed the developers were getting what they wanted.
Maybe not- maybe they found no one responds to those requests so they
stopped making them.
Ed

----------
From: Anastassis Perrakis


Dear Gerard

Isolating your main points: ... I was thinking of the inconsistency while sending my previous email ... ;-)

Basically, the parallel does apply. PDB-REPROCESS in a few years would
be really fantastic - speaking as a crystallographer and methods developer.

Speaking as a structural biologist though, I did think long and hard about
the usefulness of PDB_REDO. I obviously decided its useful since I am now
heavily involved in it for a few reasons, like uniformity of final model treatment,
improving refinement software, better statistics on structure quality metrics,
and of course seeing if the new models will change our understanding of
the biology of the system.

An experiment that I would like to do as a structural biologist - is the following:
What about adding an "increasing noise" model to the Fobs's of a few datasets and re-refining?
How much would that noise change the final model quality metrics and in absolute terms?

(for the changes that PDB_RE(BUILD) does have a preview at http://www.ncbi.nlm.nih.gov/pubmed/22034521
....I tried to avoid the shamelessly self-promoting plug-in, but could resists at the end!)

That experiment - or a better designed variant for it ? - would maybe tell us if we should be advocating the archive of all images,
and being scientifically convinced of the importance of that beyond methods development, we would all argue a strong case
to the funding and hosting agencies.

Tassos

PS Of course, that does not negate the all-important argument, that when struggling with marginal
data better processing software is essential. There is a clear need for better software
to process images, especially for low resolution and low signal/noise cases.
Since that is dependent on having test data I am all for supporting an initiative to collect such data,
and I would gladly spend a day digging our archives to contribute.

----------
From: James Holton

On general scientific principles the reasons for archiving "raw data" all boil down to one thing: there was a systematic error, and you hope to one day account for it.  After all, a "systematic error" is just something you haven't modeled yet.  Is it worth modelling?  That depends...

There are two main kinds of systematic error in MX:
1) Fobs vs Fcalc
   Given that the reproducibility of Fobs is typically < 3%, but typical R/Rfree values are in the 20%s, it is safe to say that this is a rather whopping systematic error.  What causes it?  Dunno.  Would structural biologists benefit from being able to model it?  Oh yes!  Imagine being able to reliably see a ligand that has an occupancy of only 0.05, or to be able to unambiguously distinguish between two proposed reaction mechanisms and back up your claims with hard-core statistics (derived from SIGF).  Perhaps even teasing apart all the different minor conformers occupied by the molecule in its functional cycle?  I think this is the main reason why we all decided to archive Fobs: 20% error is a lot.

2) scale factors
   We throw a lot of things into "scale factors", including sample absorption, shutter timing errors, radiation damage, flicker in the incident beam, vibrating crystals, phosphor thickness, point-spread vaiations, and many other phenomena.  Do we understand the physics behind them?  Yes (mostly).  Is there "new biology" to be had by modelling them more accurately?  No.  Unless, of course, you count all the structures we have not solved yet.

Wouldn't it be nice if phasing from sulfur, phosphorous, chloride and other "native" elements actually worked?  You wouldn't have to grow SeMet protein anymore, and you could go after systems that don't express well in E. coli.  Perhaps even going to the native source!  I think there is plenty of "new biology" to be had there.  Wouldn't it be nice if you could do S-SAD even though your spots were all smeary and overlapped and mosaic and radiation damaged?

 Why don't we do this now?  Simple!: it doesn't work.  Why doesn't it work?  Because we don't know all the "scale factors" accurately enough.  In most cases, the "% error" from all the scale factors usually adds up to ~3% (aka Rmerge, Rpim etc.), but the change in spot intensities due to native element anomalous scattering  is usually less than 1%.  Currently, the world record for smallest Bijvoet ratio is ~0.5% (Wang et al. 2006), but if photon-counting were the only source of error, we should be able to get Rmerge of ~0.1% or less, particularly in the low-angle resolution bins.  If we can do that, then there will be little need for SeMet anymore.

But, we need the "raw" images if we are to have any hope of figuring out how to get the errors down to the 0.1% level.  There is no one magic dataset that will tell us how to do this, we need to "average over" lots of them.  Yes, this is further "upstream" of the "new biology" than deposited Fs, and yes the cost of archiving images is higher, but I think the potential benefits to the structural biology community if we can crack the 0.1% S-SAD barrier is nothing short of revolutionary.

-James Holton
MAD Scientist

----------
From: Graeme Winter

Hi Ed,

Ok, I'll bite: I would be very interested to see any data sets which
initially were thought to be e.g. PG222 and scale OK ish with that but
turn out in hindsight to be say PG2. Trying to automatically spot this
or at least warn inside xia2 would be really handy. Any
pseudosymmetric examples interesting.

Also any which are pseudocentred - index OK in C2 (say) but should
really be P2 (with the same cell) as the "missing" reflections are in
fact present but are just rather weaker due to NCS.

I have one example of each from the JCSG but more would be great,
especially in cases where the structure was solved & deposited.

There we go.

Now the matter of actually getting these here is slightly harder but
if anyone has an example I will work something out. Please get in
touch off-list... I will respond to the BB in a week or so to feed
back on how responses to this go :o)

Best wishes,

Graeme

----------
From: Loes Kroon-Batenburg


The problem is that in practice, errors in the data will arise from systematic errors as James Holton nicely sums up. It is hard to 'model' these. We really need many test cases to improve data processing techniques. An interesting suggestion mentioned along the thread was to create a data base of problematic data, i.e containing non-explained peaks, diffuse streaks and strange reflection profiles etc.
Needless to say, developers (and future PDB-REPROCESS initiatives) will benefit from raw data deposition. We will have to establish if it is worth the money and effort to deposit ALL raw data.

Loes.

--
__________________________________________

Dr. Loes Kroon-Batenburg
Dept. of Crystal and Structural Chemistry
Bijvoet Center for Biomolecular Research
Utrecht University
Padualaan 8, 3584 CH Utrecht
The Netherlands
__________________________________________

----------
From: James Holton


I tried looking for such "evil symmetry problem" examples some time ago, only to find that primitive monoclinic with a 90-degree beta angle is much more rare than one might think by looking at the PDB.  About 1/3 of them are in the wrong space group.

Indeed, there are at least 366 PDB entries that claim "P2-ish", but POINTLESS thinks the space group of the deposited data is higher (PG222, C2, P6, etc.).  Now, POINTLESS can be fooled by twinned data, but at least 286 of these entries do not mention twinning.  Of these, 40 explicitly list NCS operators (not sure if the others used NCS?), and 35 of those were both solved by molecular replacement an explicitly say the free-R set was picked at random.  These are:

Now, I'm sure there is an explanation for each and every one of these.  But in the hands of a novice, such cases could easily result in a completely wrong structure giving a perfectly reasonable Rfree.  This would happen if you started with, say, a wrong MR solution, but picked your random Rfree set in PG2 and then applied "NCS".  Then each of your "free" hkls would actually be NCS-restrained to be the same as a member of the working set.  However, I'm sure everyone who reads the CCP4BB already knew that.  Perhaps because a discerning peer-reviewer, PDB annotator or some clever feature in our modern bullet-proof crystallographic software caught such a mistake for them. (Ahem)

Of course, what Graeme is asking for is the opposite of this: data that would appear as "nearly" PG222, but was actually lower symmetry.  Unfortunately, there is no way to identify such cases from deposited Fs alone, as they will have been overmerged.  In fact, I did once see a talk where someone managed to hammer an NCS 7-fold into a crystallographic 2-fold by doing some aggressive "outlier rejection" in scaling.  Can't remember if that ever got published...

-James Holton
MAD Scientist
especially in cases where the structure was solved&  deposited.

----------
From: Clemens Vonrhein

Hi James,

scary ... I was just looking at exactly the same thing (P21 with
beta~90), using the same tool (POINTLESS).

Currently I'm going through the structures for which images can be
found ... I haven't gone far through that list yet (in fact actually
only the first one), but this first case should indeed be in a higher
spacegroup (P 2 21 21).

As you say (and that's what Graeme looks for): finding 'over-merged'
datasets can be a bit more tricky ... once the damage is done. I have
the hunch that it might happen even more often though: we tend to
look for the highest symmetry that still gives a good indexing score,
right?  Otherwise we would all go for P1 ...

Some other interesting groups for under-merging:

 * orthorhombic with a==b or a==c or b==c (maybe tetragonal?)

 * trigonal (P 3 etc) when it should be P 6

 * monoclinic with beta==120

A few cases for each of those too ... all easy to check in
ftp://ftp.wwpdb.org/pub/pdb/derived_data/index/crystal.idx and then
(if structure factors are deposited) running POINTLESS on it (great
program Phil!).

Cheers

Clemens
*  Sheraton House, Castle Park
*  Cambridge CB3 0AX, UK

----------
From: Bryan Lepore


not sure I follow this thread, but this table might be interesting :

http://journals.iucr.org/d/issues/2010/05/00/dz5193/dz5193sup1.pdf

from:

Detection and correction of underassigned rotational symmetry prior to
structure deposition
B. K. Poon, R. W. Grosse-Kunstleve, P. H. Zwart and N. K. Sauter
Acta Cryst. (2010). D66, 503-513    [ doi:10.1107/S0907444910001502 ]

----------
From: Clemens Vonrhein

Oh yes, that is relevant and very interesting. As far as I understand
it, the detection of higher symmetry is based on the atomic
coordinates and not structure factors though (please correct me if I'm
wrong here).

At least some of the cases for which the deposited structure factors
strongly suggest a higher symmetry don't seem to be detected using
that papers approach (I can't find them listed in the supplemental).

Cheers

Clemens

----------
From: Felix Frolow


 God bess the symmetry, we are saved from the over-interpreting symmetry (except probably of very exotic cases) by the very high Rsym factors around 40% 50% if the symmetry is wrong.
Even wild rejection of outliers, cannot reform "acceptable" Rmerge.
In my personal repository, 1QZV  is a manifest of that. In 4.4 angstrom resolution, wrong interpretation  of 90.2 angle monoclinic angle as 90 degrees orthorhombic  supported by two molecules in the monoclinic asymmetric was corrected in  the middle of the first data collection. Habitual on-fly processing of the data (integration and repetitive scaling after every several frames with HKL) detected that about half-through the data R factor in orthorhombic space group jumped to 40% from about 7%.
Reindexing solved the problem on the spot. I still keep the raw data.
Needless to say that before about a decade  we would make precession photographs (I still own precession camera) and would not
make such a mistake. 
Dr Felix Frolow   
Professor of Structural Biology and Biotechnology
Department of Molecular Microbiology
and Biotechnology
Tel Aviv University 69978, Israel

Acta Crystallographica F, co-editor




----------
From: Felix Frolow


Clemens,
In the past, we have used  TRACER (free domain) for higher symmetry or we interpreted manually  Niggly values :-)
TRACER is gone long time ago. Niggly values are not displayed anymore, so we  trust auto indexing of DENZO which, assuming all experimental parameters are properly set ( we do this by using a standard  crystal such as lysozyme) is extremely sensitive in defining Bravais system.  I have no experience with POINTLESS, but assume that it is also doing an excellent work.


FF

----------
From: <mjvdwoerd

Reluctantly I am going to add my 2 cents to the discussion, with various aspects in one e-mail.

- It is easy to overlook that our "business" is to answer biological/biochemical questions. This is what you (generally) get grants for to do (showing that these questions are of critical importance in your ability to do science). Crystallography is one tool that we use to acquire evidence to answer questions. The time that you could get a Nobel prize for doing a structure or a PhD for doing a structure is gone. Even writing a publication with just a structure is now not as common anymore as it used to be. So the "biochemistry" drives crystallography. It is not reasonable to say that once you have collected data and you don't publish the data for 5 years, you are no longer interested. What that generally means is that "the rest of science" is not cooperating. In short: I would be against a strict rule for mandatory deposition of raw data, even after a long time. An example: I have data sets here with low resolution data (~10A) presumably of protein structures that have known structures for prokaryotes, but not for eukaryotes and it would be exciting if we could prove (or disprove) that they look the same. The problem, apart from resolution, is that the spots are so few and fuzzy that I cannot index the images. The main reason why I save the images is that if/when someone comes to me to say that they think they have made better crystals, we have something to compare. (Thanks to Gerard B. for encouragement to write this item :-)

- For those that think that we have come to the end of development in crystallography, James Holton (thank you) has described nicely why we should not think this. We are all happy if our model generates an R-factor of 20%. Even small molecule crystallographers would wave that away in an instant as inadequate. However, "everybody" has come to accept that this is fine for protein crystallography. It would be better if our models were more consistent with the experimental data. How could we make such models without access to lots of data? As a student I was always taught (when asking why 20% is actually "good") that we don't (for example) model solvent. Why not? It is not easy. If we did, would the 20% go down to 3%? I am guessing not, there are other errors that come into play.

- Gerard K. has eloquently spoken about cost and effort. Since I maintain a small (local) archive of images, I can affirm his words: a large-capacity disk is inexpensive ($100). A box that the disk sits in is inexpensive ($1000). A second box that sits in a different building, away for security reasons) that holds the backup, is inexpensive ($1400, with 4 disks). The infrastructure to run these boxes (power, fiber optics, boxes in between) is slightly more expensive. What is *really* expensive is people maintaining everything. It was a huge surprise to me (and my boss) how much time and effort it takes to annotate all data sets, rename them appropriately and file them away in a logical place so that anyone (who understands the scheme) can find them back. Therefore (!) the reason why this should be centralized is that the cost per data set stored goes down - it is more efficient. One person can process several (many, if largely automated) data sets per day. It is also of interest that we locally (2-5 people for a project) may not agree on what exactly should be stored. Therefore there is no hope that we can find consensus in the world, but we CAN get a reasonably compromise. But it is tough: I have heard the argument that data for
published structures should be kept in case someone wants to see and/or go back, while I have also heard the argument that once published it is signed, sealed and delivered and it can go, while UNpublished data should be preserved because eventually it hopefully will get to publication. Each argument is reasonably sensible, but the conclusions are opposite. (I maintain both classes of data sets.)

- Granting agencies in the US generally require that you archive scientific data. What is not yet clear is whether they would be willing to pay for a centralized facility that would do that. After all, it is more exciting to NIH to give money for the study of a disease than it is to store data. But if the argument were made that each grant(ee) would be more efficient and could apply more money towards the actual problem, this might convince them. For that we would need a reasonable consensus what we want and why. More power to John. H and "The Committee".

Thanks to complete "silence" on the BB today I am finally caught up reading!

Mark van der Woerd
 



----------
From: Deacon, Ashley M.


All,



We have been following the CCP4BB discussion with interest. As has been mentioned on several occasions,
the JCSG has maintained, for several years now, an open archive of all diffraction datasets associated with
our deposited structures. Overall this has been a highly positive experience and many developers, researchers,
teachers and students have benefited from our archive. We currently have close to 100 registered users of our

archive and we seem to receive a new batch of users each time our archive is acknowledged in a paper or is
mentioned at a conference. Building on this initial success, we are currently extending our archive to include
unsolved datasets, which will help us more readily share data and collaborate with methods developers on some
of our less tractable datasets. We are also planning to include screening images for all crystals evaluated as part
of the JCSG pipeline (largely as a feedback tool to help improve crystal quality).



At JCSG, we benefit tremendously from our central database, which already tracks all required metadata associated
with any crystal. Thus I agree with other comments that the cost of such an undertaking should not be underestimated.
The cost of the hardware may be modest; however, people and resources are needed to develop and maintain a robust
and reliable archive.



To date we have not assigned DOIs to our datasets, but we certainly feel this would be of value going forward and are
currently considering this option for our revised archive, which is currently in development.



If successful then this may form a good prototype system, which could be opened up to a broader community outside
of JCSG.



We (JCSG) have already shared much of our experiences with the IUCR working group and we would be happy to
participate
 and contribute to any ongoing  efforts.



Sincerely,
Ashley.Deacon

JCSG Structure Determination Core Leader


No comments:

Post a Comment