CCP4 Bulletin Board Archive: To archive or not to archive, that's the question!

From: Gerard DVD Kleywegt
Date: 28 October 2011 22:02

Hi all,

It appears that during my time here at Cold Spring Harbor, I have missed a small debate on CCP4BB (in which my name has been used in vain to boot).

I have not yet had time to read all the contributions, but would like to make a few points that hopefully contribute to the discussion and keep it with two feet on Earth (as opposed to La La Land where the people live who think that image archiving can be done on a shoestring budget... more about this in a bit).

Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh, and sorry for the new subject line, but this way I can track the replies more easily.

It seems to me that there are a number of issues that need to be separated:

(1) the case for/against storing raw data
(2) implementation and resources
(3) funding
(4) location

I will say a few things about each of these issues in turn:

-----------

(1) Arguments in favour and against the concept of storing raw image data, as well as possible alternative solutions that could address some of the issues at lower cost or complexity.

I realise that my views carry a weight=1.0 just like everybody else's, and many of the arguments and counter-arguments have already been made, so I will not add to these at this stage.

-----------

(2) Implementation details and required resources.

If the community should decide that archiving raw data would be scientifically useful, then it has to decide how best to do it. This will determine the level of resources required to do it. Questions include:

- what should be archived? (See Jim H's list from (a) to (z) or so.) An initial plan would perhaps aim for the images associated with the data used in the final refinement of deposited structures.

- how much data are we talking about per dataset/structure/year?

- should it be stored close to the source (i.e., responsibility and costs for depositors or synchrotrons) or centrally (i.e., costs for some central resource)? If it is going to be stored centrally, the cost will be substantial. For example, at the EBI -the European Bioinformatics Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at three data centres, plus back-ups, plus a data-duplication centre, plus unreleased versus public versions of the archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, disks have to be housed in a building (not free!), with cooling, security measures, security staff, maintenance staff, electricity (substantial cost!), rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three years (barring failures) and then needs to be replaced (at lower cost, but still not free).

- if the data is going to be stored centrally, how will it get there? Using ftp will probably not be feasible.

- if it is not stored centrally, how will long-term data availability be enforced? (Otherwise I could have my data on a public server until my paper comes out in print, and then remove it.)

- what level of annotation will be required? There is no point in having zillions of files lying around if you don't know which structure/crystal/sample they belong to, at what wavelength they were recorded, if they were used in refinement or not, etc.

- an issue that has not been raised yet, I think: who is going to validate that the images actually correspond to the structure factor amplitudes or intensities that were used in the refinement? This means that the data will have to be indexed, integrated, scaled, merged, etc. and finally compared to the deposited Fobs or Iobs. This will have to be done for *10,000 data sets a year*... And I can already imagine the arguments that will follow between depositors and "re-processors" about what software to use, what resolution cut-off, what outlier-rejection criteria, etc. How will conflicts and discrepancies be resolved? This could well end up taking a day of working time per data set, i.e. with 200 working days per year, one would need 50 *new* staff for this task alone. For comparison: worldwide, there is currently a *total* of ~25 annotators working for the wwPDB partners...

Not many of you know that (about 10 years ago) I spent probably an entire year of my life sorting out the mess that was the PDB structure factor files pre-EDS... We were apparently the first people to ever look at the tens of thousands of structure factor files and try to use all of them to calculate maps for the EDS server. (If there were others who attempted this before us, they had probably run away screaming.) This went well for many files, but there were many, many files that had problems. There were dozens of different kinds of issues: non-CIF files, CIF files with wrong headers, Is instead of Fs, Fcalc instead of Fobs, all "h" equal to 0, non-space-separated columns, etc. For a list, see: http://eds.bmc.uu.se/eds/eds_help.html#PROBLEMS

Anyway, my point is that simply having images without annotation and without reprocessing is like having a crystallographic kitchen sink (or bit bucket) which will turn out to be 50% useless when the day comes that somebody wants to do archive-wide analysis/reprocessing/rerefinement etc. And if the point is to "catch cheaters" (which in my opinion is one of the weakest, least-fundable arguments for storage), then the whole operation is in fact pointless without reprocessing by a "third party" at deposition time.

-----------

(3) Funding.

This is one issue we can't really debate - ultimately, it is the funding agencies who have to be convinced that the cost/benefit ratio is low enough. The community will somehow have to come up with a stable, long-term funding model. The outcome of (2) should enable one to estimate the initial investment cost plus the variable cost per year. Funding could be done in different ways:

- centrally - e.g., a big application for funding from NIH or EU

- by charging depositors (just like they are charged Open Access charges, which can often be reclaimed from the funding agencies) - would you be willing to pay, say, 5000 USD per dataset to secure "perpetual" storage?

- by charging users (i.e., Gerard Bricogne :-) - just kidding!

Of course, if the consensus is to go for decentralised storage and a DOI-like identifier system, there will be no need for a central archive, and the identifiers could be captured upon deposition in the PDB. (We could also check once a week if the files still exist where they are supposed to be.)

-----------

(4) Location.

If the consensus is to have decentralised storage, the solution is quite simple and very cheap in terms of "centralised" cost - wwPDB can capture DOI-like identifiers upon deposition and make them searchable.

If central storage is needed, then there has to be an institution willing and able to take on this task. The current wwPDB partners are looking at future funding that is at best flat, with increasing numbers of depositions that also get bigger and more complex. There is *no way on earth* that wwPDB can accept raw data (be it X-ray, NMR or EM! this is not an exclusive X-ray issue) without *at least* double the current level of funding (and not just in the US for RCSB, but also in Japan for PDBj and in Europe for PDBe)! I am pretty confident that this is simply *not* going to happen.

[Besides, in my own humble opinion, in order to remain relevant (and fundable!) in the biomedical world, the PDB will have to restyle itself as a biomedical resource instead of a crystallographic archive. We must take the structures to the biologists, and we must expand in breadth of coverage to include emerging hybrid methods that are relevant for structural cell (as opposed to molecular) biology. This mission will be much easier to fund on three continents than archiving TBs of raw data that have little or no tangible (i.e., fundable) impact on our quest to find a cure for various kinds of cancer (or hairloss) or to feed a growing population.]

However, there may be a more realistic solution. The role model could be NMR, which has its own global resource for data storage in the BMRB. BMRB is a wwPDB partner - if you deposit an NMR model with us, we take your ensemble coordinates, metadata, restraints and chemical shifts - any other NMR data (including spectra and FIDs) can subsequently be deposited with BMRB. These data will get their own BMRB ID which can be linked to the PDB ID.

A model like this has advantages - it could be housed in a single place, run by X-ray experts (just as BMRB is co-located with NMRFAM, the national NMR facility at Madison), and there would be only one place that would need to secure the funding (which would be substantially larger than the estimate of $1000 per year suggested by a previous poster from La La Land). This could for instance be a synchrotron (linked to INSTRUCT?), or perhaps one of the emerging nations could be enticed to take on this challenging task. I would expect that such a centre would be closely affiliated with the wwPDB organisation, or become a member just like BMRB. A similar model could also be employed for archiving raw EM image data.

-----------

I've said enough for today. It's almost time for the booze-up that kicks off the PDB40 symposium here at CSHL! Heck, some of you who read this might be here as well!

Btw - Colin Nave wrote:

"(in increasing order of influence/power do we have the Pope, US president, the Bond Market and finally Gerard K?)"

I'm a tad disappointed to be only in fourth place, Colin! What has the Pope ever done for crystallography?

--Gerard

******************************************************************
Gerard J. Kleywegt

http://xray.bmc.uu.se/gerard
******************************************************************
The opinions in this message are fictional. Any similarity
to actual opinions, living or dead, is purely coincidental.
******************************************************************
Little known gastromathematical curiosity: let "z" be the
radius and "a" the thickness of a pizza. Then the volume
of that pizza is equal to pi*z*z*a !
******************************************************************

----------
From: Colin Nave

Gerard
I said in INCREASING order of influence/power i.e. you are in first place.

The joke comes from
" I used to think if there was reincarnation, I wanted to come back as the President or the Pope or a .400 baseball hitter. But now I want to come back as the bond market. You can intimidate everyone.
--James Carville, Clinton campaign strategist"

Thanks for the comprehensive reply
Regards
Colin

----------
From: Gerard DVD Kleywegt

Ooohhhh! *Now* it makes sense! :-)

--Gerard

Best wishes,

----------
From: Ethan Merritt

http://covers.openlibrary.org/b/id/5923051-L.jpg

--
Ethan A Merritt
Biomolecular Structure Center, K-428 Health Sciences Bldg
University of Washington, Seattle 98195-7742

----------
From: Gerard DVD Kleywegt

Fock'n'Pope! Great find, Ethan! So maybe he deserves fourth place after all.

----------
From: Gerard Bricogne

Dear Gerard,

I think that a major achievement of this online debate will have been
to actually get you to carry out a constructive analysis (an impressive one,
I will be the first to say) of this question, instead of dismissing it right
away. It is almost as great an achievement as getting the Pope to undergo
psychoanalysis! (I am thinking here of the movie "Habemus Papam".)

It is very useful to have the facts and figures you mention for the
costs of full PDB officialdom for the storage of raw data. I think one could
describe the first stage towards that, in the form I have been mentioning as
the "IUCr DDDWG pilot project", as first trying to see how to stop those raw
images from disappearing, pending the mobilisation of more resources towards
eventually putting them up in five-star accommodation (if they are thought
to be earning their keep). I am again hopeful that anticipated difficulties
at the five-star stage (with today's cost estimates) will not stop us from
trying to do what is possible today in this pilot project, and I also hope
that enough synchrotrons and depositors will volunteer to take part in it.

The extra logistical load on checking that submitted raw images sets do
correspond to the deposited structure should be something that can be pushed
down towards the synchrotron sources, as was mentioned for the proper
book-keeping of "metadata", as part of keeping tidy records linking user
project databases to datasets, and towards enhancements in data processing
and structure determination pipelines to keep track of all stages of the
derivation of the deposited results from the raw data. Not trivial, but not
insuperable, and fully in the direction of more automation and more
associated record keeping. This is just to say that it needs not all land on
the PDB's shoulders in an initially amorphous state.

In any case, thank you for devoting so much time and attention to this
nuts-and-bolts discussion when there are so many tempting forms of high
octane entertainment around!

With best wishes,

Gerard (B.)

--

--

===============================================================
* *
* Gerard Bricogne
* *
* Global Phasing Ltd. *
* Sheraton House, Castle Park
* Cambridge CB3 0AX, UK
* *
===============================================================

----------
From: Herbert J. Bernstein

As the poster who mentioned the $1000 - $3000 per terabyte per year
figure, I should point out that the figure originated not from "La La
land" but from an NSF RDLM workshop in Princeton last summer. Certainly
the actual costs may be higher or lower depending on economies/diseconomies of scale and required ancilary task to be
performed. The base figure itself seems consistent with the GBP 1500
figure cited for EBI.

That aside, the list presented seems very useful to the discussion.
I would suggest adding to it the need to try to resolve the
complex intellectual property issues involved. This might be
a good time to try to get a consensus of the scientific community
of what approach to IP law would best serve our interests going
forward. The current situation seems a bit messy.

Regards,
Herbert

=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
=====================================================

----------
From: Jrh

Dear Gerard K,
Many thanks indeed for this.
Like Gerard Bricogne you also indicate that the location option being the decentralised one is 'quite simple and very cheap in terms of centralised cost'. The SR Facilities worldwide I hope can surely follow the lead taken by Diamond Light Source and PaN, the European Consortium of SR and Neutron Facilities, and keep their data archives and also assist authors with the doi registration process for those datasets that result in publication. Linking to these dois from the PDB for example is as you confirm straightforward.

Gerard B's pressing of the above approach via the 'Pilot project' within the IUCr DDD WG various discussions, with a nicely detailed plan, brought home to me the merit of the above approach for the even greater challenge for raw data archiving for chemical crystallography, both in terms of number of datasets and also the SR Facilities role being much smaller. IUCr Journals also note the challenge of moving large quantities of data around ie if the Journals were to try and host everything for chemical crystallography, and them thus becoming 'the centre' for these datasets.

So:- Universities are now establishing their own institutional repositories, driven largely by Open Access demands of funders. For these to host raw datasets that underpin publications is a reasonable role in my view and indeed they already have this category in the University of Manchester eScholar system, for example. I am set to explore locally here whether they would accommodate all our Lab's raw Xray images datasets per annum that underpin our published crystal structures.

It would be helpful if readers of this CCP4bb could kindly also explore with their own universities if they have such an institutional repository and if raw data sets could be accommodated. Please do email me off list with this information if you prefer but within the CCP4bb is also good.

Such an approach involving institutional repositories would also work of course for the 25% of MX structures that are for non SR datasets.

All the best for a splendid PDB40 Event.

Greetings,
John
Prof John R Helliwell DSc

----------
From: Herbert J. Bernstein

One important issue to address is how deal with the perceived
reliability issues of the federated model and how to start to
approach the higher reliability of the centralized model described bu
Gerard K, but without incurring what seems to be at present
unacceptable costs. One answer comes from the approach followed in
communications systems. If the probability of data loss in each
communication subsystem is, say, 1/1000, then the probability of data
loss in two independent copies of the same lossy system is only
1/1,000,000. We could apply that lesson to the
federated data image archive model by asking each institution
to partner with a second independent, and hopefully geographically
distant, institution, with an agreement for each to host copies
of the other's images. If we restrict that duplication protocol, at least at
first, to those images strongly related to an actual publication/PDB
deposition, the incremental cost of greatly improved reliability
would be very low, with no disruption of the basic federated
approach being suggested.

Please note that I am not suggesting that institutional repositories
will have 1/1000 data loss rates, but they will certainly have some
data loss rate, and this modest change in the proposal would help to
greatly lower the impact of that data loss rate and allow us to go
forward with greater confidence.

Regards,
Herbert

--
Dowling College, Brookhaven Campus, B111B
1300 William Floyd Parkway, Shirley, NY, 11967

=====================================================

----------
From: Jrh

Dear Herbert,
I imagine it likely that eg The Univ Manchester eScholar system will have in place duplicate storage for the reasons you outline below. However for it to be geographically distant is, to my reckoning, less likely, but still possible. I will add that further query to my first query to my eScholar user support re dataset sizes and doi registration.

Greetings,
John
Prof John R Helliwell DSc

----------
From: Herbert J. Bernstein

Dear John,

Most sound institutional data repositories use some form of
off-site backup. However, not all of them do, and the
standards of reliabilty vary. The advantages of an explicit
partnering system are both practical and psychological. The
practical part is the major improvement in reliability --
even if we start at 6 nines, 12 nines is better. The
psychological part is that members of the community can
feel reassured that reliability has in been improved to
levels at which they can focus on other, more scientific
issues, instead ot the question of reliability.

Regards,
Herbert

- should it be stored close to the source (i.e., responsibility and costs for depositors or synchrotrons) or centrally (i.e., costs for some central resource)? If it is going to be stored centrally, the cost will be substantial. For example, at the EBI -the European Bioinformatics Institute- we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage (not the kind you buy at Dixons or Radio Shack, obviously). For stored data, we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at three data centres, plus back-ups, plus a data-duplication centre, plus unreleased versus public versions of the archive). (Note - this is only for the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover, disks have to be housed in a building (not free!), with cooling, security measures, security staff, maintenance staff, electricity (substantial cost!), rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three ye!

----------
From: Kay Diederichs

Am 20:59, schrieb Jrh:
... Dear John,

I'm pretty sure that there exists no consistent policy to provide an "institutional repository" for deposition of scientific data at German universities or Max-Planck institutes or Helmholtz institutions, at least I never heard of something like this. More specifically, our University of Konstanz certainly does not have the infrastructure to provide this.

I don't think that Germany is the only country which is the exception to any rule of availability of "institutional repository" . Rather, I'm almost amazed that British and American institutions seem to support this.

Thus I suggest to not focus exclusively on official institutional repositories, but to explore alternatives: distributed filestores like Google's BigTable, Bittorrent or others might be just as suitable - check out http://en.wikipedia.org/wiki/Distributed_data_store. I guess that any crystallographic lab could easily sacrifice/donate a TB of storage for the purposes of this project in 2011 (and maybe 2 TB in 2012, 3 in 2013, ...), but clearly the level of work to set this up should be kept as low as possible (a bittorrent daemon seems simple enough).

Just my 2 cents,

Kay

----------
From: Anastassis Perrakis

Dear all,

The discussion about keeping primary data, and what level of data can be considered 'primary', has - rather unsurprisingly - come up also in areas other than structural biology.

An example is next generation sequencing. A full-dataset is a few tera bytes, but post-processing reduces it to sub-Gb size. However, the post-processed data, as in our case,

have suffered the inadequacy of computational "reduction" ... At least out institute has decided to create double back-up of the primary data in triplicate. For that reason our facility bought

three -80 freezers, one on site at the basement, on at the top floor, and one off-site, and they keep the DNA to be sequenced. A sequencing run is already sub-1k$ and it will not become

more expensive. So, if its important, do it again. Its cheaper and its better.

At first sight, that does not apply to MX. Or does it?

So, maybe the question is not "To archive or not to archive" but "What to archive".

(similarly, it never crossed my mind if I should "be or not be" - I always wondered "what to be")

Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member

Department of Biochemistry (B8)

Netherlands Cancer Institute,

----------
From: Gerard Bricogne

Dear Tassos,

It is unclear whether this thread will be able to resolve your deep
existential concerns about "what to be", but you do introduce a couple of
interesting points: (1) raw data archiving in areas (of biology) other than
structural biology, and (2) archiving the samples rather than the verbose
data that may have been extracted from them.

Concerning (1), I am grateful to Peter Keller here in my group for
pointing me, mid-August when we were for the n-th time reviewing the issue
of raw data deposition under discussion in this thread, and its advantages
over only keeping derived data extracted from them, towards the Trace
Archive of DNA sequences. He found an example, at

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?&cmd=retrieve&val=12345&dopt=trace&size=1&retrieve=Submit

You can check the "Quality Score" box below the trace, and this will refresh
the display to give a visual estimate of the reliability of the sequence.
There is clearly a problem around position 210, that would not have been
adequately dealt with by just retaining the most probable sequence. In this
context, it has been found worthwhile to preserve the raw data, to make it
possible to "audit" derived data against them. This is at least a very
simple example of what you were referring to whan you wrote about the
inadequacy of computational "reduction". In the MX context, this is rather
similar to the contamination of integrated intensities by spots from
parasitic lattices (which would still affect unmerged intensities, by the
way - so upgrading the pdb "structure factor" file to unmerged data would
take care of over-merging, but not of that contamination).

Concerning (2) I greatly doubt there would be an equivalent for MX: few
people would have spare crystals to put to one side for a future repeat of a
diffraction experiment (except in the case of lysozyme/insulin/thaumatin!).
I can remember an esteemed colleague arguing 4-5 years ago that if you want
to improve a deposited structure, you could simply repeat the work from
scratch - a sensible position from the philosophical point of view (science
being the art of the repeatable), but far less sensible in conditions of
limited resources, and given also the difficulties of reproducing crystals.
The real-life situation is more a "Carpe diem" one: archive what you have,
as you may never see it again! Otherwise one would easily get drawn into the
same kind of unrealistic expectations as people who get themselves frozen in
liquid N2, with their blood replaced by DMSO, hoping to be brought back to
life some day in the future ;-) .

With best wishes,

Gerard.

--

----------
From: Martin Kollmar

Still, after hundreds (?) of emails to this topic, I haven't seen any convincing argument in favor of archiving data. The only convincing arguments are against, and are from Gerard K and Tassos.

Why?
The question is not what to archive, but still why should we archive all the data.

Because software developers need more data? Should we raise all these efforts and costs because 10 developers worldwide need the data to ALL protein structures? Do they really need so much data, wouldn't it be enough to build a repository of maybe 1000 datasets for developments?

Does really someone believe that our view on the actual problem, the function of the proteins, changes with the analysis of whatsoever scattering is still in the images but not used by today's software? Crystal structures are static, snapshots, and obtained under artificial conditions. In solution (still the physiologic state) they might look different, not much, but at least far more dynamic. Does it therefore matter whether we know some sidechain positions better (in the crystal structure) when re-analysing the data? In turn, are our current software programs such bad that we would expect strong difference when re-analysing the data? No. And if the structures change upon reanalysis (more or less) who does re-interpret the structures, re-writes the papers?

There are many many cases where researchers re-did structures (or did closely related structures to already available structures like mutants, structures of closely related species, etc.), also after 10 years. I guess they used the latest software in the different cases, thus they incorporated all the software development of the 10 years. And are the structures really different (beyond the introduced changes, mutations, etc.)? Different because of the software used?

The comparison with next-generation sequencing data is useful here, but only in the sense Tassos explained. Well, of course not every position in the genomic sequence is fixed. Therefore it is sometimes useful to look at the original data (the traces, as Gerard B pointed out). But we already know, that every single organism is different (especially eukaryotes) and therefore it is absolutely enough to store the computationally reduced and merged data. If one needs better, position-specific data, sequencing and comparing single species becomes necessary, like in the ENCODE project, the sequencing of about 100 Saccharomyces strains, the sequencing of 1000 Arabidopsis strains, etc. Discussion about single positions are useless if they are not statistically relevant. They need to be analysed in the context of populations, large cohorts of patients, etc. If we need personalized medicine adapted to personal genomes, we would also need personal sets of protein structures which we cannot provide yet. Therefore, storing the DNA in the freezer is better and cheaper than storing all the sequencing raw data. Do you think a reviewer re-sequences, or re-assembles, or re-annotates a genome, even if access to the raw reads would be available? If you trust these data why don't we trust our structure factors? Do you trust electron microscopy images, movies of GFP-tagged proteins? Do you think what is presented for a single or a few visible cells is also found in all cells?

And now, who many of you (if not everybody) uses structures from yeast, Drosophila, mouse etc. as MODEL for human proteins? If we stick to this thinking, who would care about potential minor changes in the structures upon re-analysis (and in the light of this discussion, arguing about specific genomic sequence positions becomes unimportant as well)?

Is any of the archived data useful without manual evaluation upon archiving? This is especially relevant for structures not solved yet. Do the images belong to the structure factors, if only images are available, where is the corresponding protein sequence, has it been sequenced, what has been in the buffer/crystallization condition, what has been used during protein purification, what was the intention for crystallization - e.g. a certain functional state, that the protein was forced to by artificial conditions, etc. etc. Who want's to evaluate that, and how? The question is not that we could do it. We could do it, but wouldn't it advance science far more if we would spend the time and money in new projects rather than evaluation, administration, etc?

Be honest: How many of you have really, and completely, reanalysed your own data, that you have deposited 10 years ago, with the latest software? What changes did you find? Did you have to re-write your former discussions in the publications? Do you think that the changes justify the efforts and costs of worldwide archiving of all data?

Well, for all cases there are always (and have been mentioned in earlier emails) single cases where these things matter or mattered. But does this really justify all the future efforts and costs to archive the exponentially (!) increasing amount of data? Do we need all this effort for better statistics tables? Do you believe the standard lab biologist will look into all the images at all? Is the effort just for us crystallographers? As long as just a few dozen users would re-analyse the data it is not worth it.

I like question marks, and maybe someone can give me an argument for archiving images. At the moment I would vote for not archiving.

With best regards,

Martin

P.S. For the next-gen sequencing data, they have found a new way of transferring the data, called VAN (the newbies might google for it) in analogy to the old-fashioned and slow LAN and WLAN. Maybe we will also adopt to this when archiving our data?

--
Priv. Doz. Dr. Martin Kollmar

Max-Planck-Institute for Biophysical Chemistry
Group Systems Biology of Motor Proteins
Department NMR-based Structural Biology
Am Fassberg 11
37077 Goettingen
Deutschland

www.motorprotein.de (Homepage)
www.cymobase.org (Database of Cytoskeletal and Motor Proteins)
www.diark.org (diArk - a resource for eukaryotic genome research)
www.webscipio.org (Scipio - eukaryotic gene identification)

----------
From: Robert Esnouf

Dear All,

As someone who recently left crystallography for sequencing, I
should modify Tassos's point...

"A full data-set is a few terabytes, but post-processing
My experience from HiSeqs is that this "full" here means the
base calls - equivalent to the unmerged HKLs - hardly raw
data. NGS (short-read) sequencing is an imaging technique and
the images are more like >100TB for a 15-day run on a single
flow cell. The raw base calls are about 5TB. The compressed,
mapped data (BAM file, for a human genome, 30x coverage) is
about 120GB. It is only a variant call file (VCF, difference
from a stated human reference genome) that is sub-Gb and these
files are - unsurprisingly - unsuited to detailed statistical
analysis. Also $1k is a not yet an economic cost...

The DNA information capacity in a single human body dwarfs the
entire world disk capacity, so storing DNA is a no brainer
here. Sequencing groups are making very hard-nosed economic
decisions about what to store - indeed it is a source of
research in itself - but the scale of the problem is very much
bigger.

My tuppence ha'penny is that depositing "raw" images along
with everything else in the PDB is a nice idea but would have
little impact on science (human/animal/plant health or
understanding of biology).

1) If confined to structures in the PDB, the images would just
be the ones giving the final best data - hence the ones least
likely to have been problematic. I'd be more interested in
SFs/maps for looking at ligand-binding etc...

2) Unless this were done before paper acceptance they would be
of little use to referees seeking to review important
structural papers. I'd like to see PDB validation reports
(which could include automated data processing, perhaps culled
from synchrotron sites, SFs and/or maps) made available to
referees in advance of publication. This would be enabled by
deposition, but could be achieved in other ways.

3) The datasets of interest to methods developers are unlikely
to be the ones deposited. They should be in contact with
synchrotron archives directly. Processing multiple lattices is
a case in point here.

4) Remember the "average consumer" of a PDB file is not a
crystallographer. More likely to be a graduate student in a
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not trivializing
the issue, but importance is always relative. Are there
"outsiders" on the panel to keep perspective?

Robert

--

Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK

----------
From: Oganesyan, Vaheh

I was hesitant to add my opinion so far because I'm used more to listen this forum rather than tell others what I think.

"Why" and "what" to deposit are absolutely interconnected. Once you decide why you want to do it, then you will probably know what will be the best format and vice versa.

Whether this deposition of raw images will or will not help in future understanding the biology better I'm not sure.

But to store those difficult datasets to help the future software development sounds really farfetched. This assumes that in the future crystallographers will never grow crystals that will deliver difficult datasets. If that is the case and in 10-20-30 years next generation will be growing much better crystals then they don't need such a software development.

If that is not the case, and once in a while (or more often) they will be getting something out of ordinary then software developers will take them and develop whatever they need to develop to consider such cases.

Am I missing a point of discussion here?

Regards,

     Vaheh  

----------
From: David Waterman

I have no doubt there are software developers out there who have spent years building up their own personal collections of 'interesting' datasets, file formats, and various oddities that they take with them wherever they go, and consider this collection to be precious. Despite the fact that many bad datasets are collected daily at beamlines the world over, it is amazing how difficult it can be to find what you want when there is no open, single point-of-access repository to search. Simply asking the crystallographers and beamline scientists doesn't work: they are too busy doing their own jobs.

-- David

----------
From: Clemens Vonrhein

Dear Vaheh,
As far as I see the general plan, that would be a second stage (deposit
all datasets) - the first one would be the datasets related directly
to a given PDB entry.
Oh sure they will. And lots of those datasets will be available to
developers ... being thrown a difficult problem under pressure is a
very good thing to get ideas, think out of the box etc. However,
developing solid algorithms is better done in a less hectic
environment with a large collection of similar problems (changing only
one parameter at a time) to test a new method.
They'll grow better crystals for the type of project we're currently
struggling with, sure. But we'll still get poor crystals for projects we
don't even attempt or tackle right now.

Software development is a slow process, often working on a different
timescale than the typical structure solution project (obvious there
are exceptions). So planing ahead for that time will prepare us.

And yes, it will have an impact on the biology then. It's not just the
here and now (and next grant, next high-profile paper) we should be
thinking about.
One small point maybe: there are very few developers out there - but a
very large number of users that benefit from what they have
done. Often the work is not very visible ("It's just pressing a button
or two ... so it must be trivial!") - which is a good thing: it has to
be simple, robust, automtic and useable.

I think if a large enough number of developers consider depositing
images a very useful resource for their future development (and
therefore future benefit to a large number of users), it should be
seriously considered, even if some of the advertised benefits have to
be taken on trust.

Past developments in data processing have had a big impact on a lot of
projects - high-profile or just the standard PhD-student nightmare -
with often small return for the developers in terms of publications,
grants or even citations (main paper or supplementary material).

So maybe in the sprit of the festive season it is time to consider
giving a little bit back? What is there to loose? Another 20 minutes
additional deposition work for the user in return for maybe/hopefully
saving a whole project 5 years down the line? Not a bad investment it
seems to me ...

Cheers

Clemens

--

***************************************************************
* Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com
* Sheraton House, Castle Park
*--------------------------------------------------------------
* BUSTER Development Group (http://www.globalphasing.com)
***************************************************************

----------
From: Martin Kollmar

The point is that science is not collecting stamps. Therefore the first question should always be "Why". If you start with "What" the discussion immediately switches to technical issues like how many TB, PB etc. $/€, manpower. And all the intense discussion will blow out by one single "Why". Nothing is for free. But if it would help science and mankind, nobody would hesitate to spend millions of $/€.

Supporting software development / software developers is a different question. If this were the first question that someone would have asked the answer would have never been "archiving all datasets worldwide / deposited structures", but how could we, the community, build up a resource with different kind of problems (e.g. space groups, twinning, overlapping lattices, etc.).

I still didn't got an answer for "Why".

Best regards,
Martin

Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh:

----------
From: Gerard Bricogne

Dear Martin,

Thank you for this very clear message about your views on this topic.
There is nothing like well articulated dissenting views to force a real
assessment of the initial arguments, and you have certainly provided that.

As your presentation is "modular", I will interleave my comments with
your text, if you don't mind.

--
A first impression is that your remark rather looks down on those "10
developers worldwide", a view not out of keeping with that of structural
biologists who have moved away from ground-level crystallography and view
the latter as a "mature technique" - a euphemism for saying that no further
improvements are likely nor even necessary. As Clemens Vonrhein has just
written, it may be the very success of those developers that has given the
benefit of what software can do to users who don't have the faintest idea of
what it does, nor of how it does it, nor of what its limitations are and how
to overcome those limitations - and therefore take it for granted.

Another side of the "mature technique" kiss of death is the underlying
assumption that the demands placed on crystallographic methods are
themselves static, and nothing could be more misleading. We get caught time
and again by rushed shifts in technology without proper precautions in case
the first adaptations of the old methods do not perform as well as they
might later. Let me quote an example: 3x3 CCD detectors. It was too quickly
and hurriedly assumed that, after correcting the images recorded on these
instruments for geometric distortions and flat-field response, one would get
images that could be processed as if they came from image plates (or film).
This turned out to be a mistake: "corner effects" were later diagnosed, that
were partially correctible by a position-dependent modulation factor,
applied for instance by XDS in response to the problem. That correction is
not just detector-dependent and applicable to all datasets recorded on a
given detector, unfortunately, as it is related to a spatial variation in
the point-spread function. - so you really need to reprocess each set of
images to determine the necessary corrections. The tragic thing is that for
a typical resolution limit and detector distance, these corners cut into the
quick of your strongest secondary-structure defining data. If you have kept
your images, you can try and recover from that; otherwise, you are stuck
with what can be seriously sub-optimal data. Imagine what this can do to SAD
anomalous difference when Bijvoet pairs fall on detector positions where
these corner effects are vastly different ... .

Another example it that of the recent use of numerous microcrystals,
each giving a very small amount of data, to assemble datasets for solving
GPCR structures. The methods for doing this, for getting the indexing and
integration of such thin slices of data and getting the overall scaling to
behave, are still very rough. It would be pure insanity to throw these
images away and not to count on better algorithms to come along and improve
the final data extractible from them.

--
I think that, rather than asking rhetorical questions about people's
beliefs regarding such a general question, one needs testimonies about real
life situations. We have helped a great many academic groups in the last 15
years: in every case, they ended up feeling really overjoyed that they had
kept their images when they had, and immensely regretful when they hadn't.
I noticed, for example, that your last PDB entry, 1LKX (2002) does not have
structure factor data associated with it. It is therefore impossible for
anyone to do anything about its 248 REMARK 500 records complaining about bad
(PHI,PSI) values; whereas if the structure factors had been deposited, all
our own experience in this area suggests that today's refinement programs
would have helped a great deal towards this.

Otherwise you invoke the classical arguments about the possible
artificiality of crystal structures because they are "static" etc. . Even if
this is the case, it does not diminish the usefulness of characterising what
they enable us to see with the maximum possible care and precision. The
"dynamic" aspect of NMR structure ensembles can hide a multitude of factors
that are inaccuracies rather than a precise characterisation of dynamics.
Shall I dare mention that a favourite reverse acronym for NMR is "Needs More
Resolution"? (Sorry, crystallographer's joke ;-) ...).

Finally, it isn't because no one would have the time to write a paper
after having corrected or improved a PDB entry that he/she would not have
the benefit of those corrections or improvements when using that entry for
modelling or for molecular replacement.

--
Again this is a very broad question, to which answers would constitute
a large and varied sample. To use an absence of organised detailed evidence
to justify not doing something is not the best kind of argument.

--
I think we are straying into facts and figures completely unrelated to
the initial topic. Characteristically it comes from areas in which fuzziness
is rampant - I do not see why this should deter crystallography from
treasuring the high level of accurate detail reachable by their own methods
in their own area.

--
There are many ways of advancing science, and perhaps every
specialist's views of this question are biased towards his/her own. We agree
that archiving of images without all the context within which they were
recorded would be futile. Gerard K raised the issue of all the manual work
one might have to contemplate if the PDB staff were to manually check that
the images do belong where they are supposed to. I think this is a false
problem. We need to get the synchrotron beamlines to do a better, more
consistent, more standardised job of keeping interconnected records linking
user projects to sample decriptors to image sets to processing results. The
pharma industry do that successfully: when they file the contents of their
hard disk after a synchrotron trip, they do not rely on extra staff to check
that the image do go with the targets and the ligands, as if they had just
received them in a package dellivered in the morning post: consistency is
built into their book-keeping system, that includes the relevant segment of
that process that gets executed at the synchrotron.

--
OK, good question, but the answer might not be what you expect. It is
the possibility of going back to raw data if some "auditing" of an old
result is required that is the most important. It is like an insurance
policy: would you ask people "How many of you have made calls on your
policies recently" and use the smallness of the proportion of YESs as an
argument for not getting one?

--
I think that here again you are calling upon the argument of the
"market" for structural results among "standard lab biologist(s)". This is
important of course, and laudible efforts are being made by the PDB to make
its contents more aprroachable and digestible by that audience. That is a
different question, though, from that of continuing to improving the
standards of quality of crystallographic results produced by the community,
and in particular of the software tools produced by methods developers. On
that side of the divide, different criteria apply from those that matter the
most in the "consumer market" of lab biologists. The shift to
maximum-likelihood methods in phasing and refinement, for instance, did not
take place in response to popular demand from that mass market, if I recall
- and yet it made a qualitative difference to the quality and quantity of
the results they now have at their disposal.

--
I think that the two examples I gave at the beginning should begin to
answer your "why" question: because each reduced dataset might have fallen
victim to unanticipated shortcomings of the software (and underlying
assumptions) available at the time. I will be hard to convince that one can
anticipate the absence of unanticipated pitfalls of this kind ;-) .

With best wishes,

Gerard.

> With best regards,
>
> Martin
>
>
> P.S. For the next-gen sequencing data, they have found a new way of
> transferring the data, called VAN (the newbies might google for it) in
> analogy to the old-fashioned and slow LAN and WLAN. Maybe we will also
> adopt to this when archiving our data?
>
> --
> Priv. Doz. Dr. Martin Kollmar
>
> Max-Planck-Institute for Biophysical Chemistry
> Group Systems Biology of Motor Proteins
> Department NMR-based Structural Biology
> Am Fassberg 11
> 37077 Goettingen
> Deutschland
>

>
> www.motorprotein.de (Homepage)
> www.cymobase.org (Database of Cytoskeletal and Motor Proteins)
> www.diark.org (diArk - a resource for eukaryotic genome research)
> www.webscipio.org (Scipio - eukaryotic gene identification)

----------
From: Kelly Daughtry

I believe that archiving original images for published data sets could be very useful, if linked to the PDB.

I have downloaded SFs from the PDB to use for re-refinement of the published model (if I think the electron density maps are misinterpreted) and personally had a different interpretation of the density (ion vs small ligand). With that in mind, re-processing from the original images could be useful for catching mistakes in processing (especially if a high R-factor or low I/sigma are reported), albeit it a small percentage of the time.

As for difficult data sets, problematic cases, etc, I can see the importance of their availability by the preceding arguments.

It seems to be most useful for software developers. In that case, I would suggest software developers to publicly request our difficult to process images, or create their own repository. Then they can store and use the data as they like. I would happily upload a few data sets.

(Just a suggestion)

Best Wishes,

Kelly Daughtry

*******************************************************
Kelly Daughtry, Ph.D.
Post-Doctoral Fellow, Raetz Lab
Biochemistry Department
Duke University
Alex H. Sands, Jr. Building
303 Research Drive
RM 250
Durham, NC 27710
*******************************************************

----------
From: Gerard Bricogne

Dear Martin,

First of all I would like to say that I regret having made my "remark
500" and apologise if you read it as a personal one - I just saw it as an
example of a dataset it might have been useful to revisit if data had been
available in any form. I am sure that there are many skeletons in many
cupboards, including my own :-) .

Otherwise, as the discussion does seem to refocus on the very initial
proposal in gestation within the IUCr's DDDWG, i.e. voluntary involvement of
depositors and of synchrotrons, so that questions of logistics and cost
could be answered in the light of empirical evidence, your "Why" question is
the only one unanswered by this proposal, it seems.

In this respect I wonder how you view the two examples I gave in my
reply to your previous message, namely the "corner effects" problem and the
re-development of methods for collating data from numerous small, poorly
diffracting crystals as was done in the recent solution of GPCR structures.
There remains the example I cited from the beginning, namely the integration
of images displaying several overlapping lattices.
> immediately switches to technical issues like how many TB, PB etc. $/EUR,
> hesitate to spend millions of $/EUR.
>> format and /vice versa/.

CCP4 Bulletin Board Archive

Tuesday, 22 November 2011

To archive or not to archive, that's the question!

No comments:

Post a Comment

Followers