CCP4 Bulletin Board Archive: November 2011

Wednesday 30 November 2011

X-ray protein crystallography course at the Department of Biochemistry, University of Oulu, Oulu, Finland

Date: 11 November 2011 14:57

This is the second announcement of an X-ray protein crystallography
course in Oulu, Finland, from January, 30 till February 3, 2012.
Further details of the outline of the program are given on the
WWW-site, as listed below. The topic of the course focuses on data
collection, data processing, data tracking and phasing. The course is
sponsored by Biocenter Oulu, Biocenter Finland and BioStruct-X.

The following teachers have agreed to contribute:
-Manfred Weiss (Berlin, Germany) (General Introduction)
-Rik Wierenga (Oulu, Finland) (General Introduction)
-Harry Powell (Cambridge, UK) (MOSFLM/SCALA)
-Marianna Biadene (Karlsruhe, Germany) (PROTEUM)
-Kay Diederichs (Constanz, Germany) (XDS)
-George Sheldrick (Göttingen, Germany) (SHELXC/D/E)
-Martin Walsh (Diamond, UK) (on-site and remote data collection at the
Diamond synchrotron)
-Chris Morris (Daresbury, UK) (PiMS/xtalPiMS)
-Stuart McNicholas (York, UK) (CCP4MG)

It will be possible to collect in-house data. More details are also
available on our WWW-site. Participants are encouraged to bring their
own data and/or their own crystals.

Please contact soon Vanja Kapetaniou for registration, as documented on the
WWW-site; please include in the registration a cv and a brief
statement on your motivation why this course will be beneficial for you.

Rik Wierenga, Vanja Kapetaniou, Kristian Koski
http://www.biochem.oulu.fi/struct/xraycourse/

Tuesday 29 November 2011

crystallization of synthetic peptides

From: H. Raaijmakers
Date: 10 November 2011 15:16

Dear crystallographers,

Because of the low cost and speed of synthesizing 40- to 60-mer peptides,
I wonder whether anyone has (good or bad) experiences crystalizing such
peptides. In literature, I've found up to 34-mer synthetic coiled coils,
but no other protein class. I can imagine that a protein sample with a few
percent "random deletion mutants" mixed into it won't crystallize easily,
but has anyone actually tried?

cheers,

Hans

----------
From: <mjvdwoerd

Hans,

Most natural toxins from snakes, scorpions etc are 50+/-some peptides. And quite a few of those have been studied and crystallized (see pdb for a list). Having worked on one of these structures as a graduate student, I can share my experience:
- Purification is harder than you would think. You are talking about < 10kD, usually around 5kD. Many methods (size exclusion, even concentration over a simple membrane) don't work as easily as you would like.
- I did not have much of a problem crystallizing (i.e. no worse than other proteins, maybe even a little easier)
- Crystals tend to diffract well (maybe better than average)
- Structures can be hard to solve; MIR is very difficult because ions tend to not go into such crystals easily (because the molecules are small and tightly packed?); MR is hard because (again) it does not work very well on very small systems
- Crystallization is not necessarily purification - if you have a mixture of peptides to start with, it may be harder to crystallize, or not: you might get a crystal that is a (random-ish) mixture.
- If you have more than two cysteines in your sequence (natural toxins typically do), the additional problem is to get the correct folding and disulphide bridges; alternatively it is very hard to discriminate between correctly and incorrectly linked disulphides

Finally:
These sequence should be small enough for NMR. That may or may not answer your questions, but it avoids your original question.

Mark

----------
From: George Sheldrick

As Mark says, structure solution of smallish peptides is not usually as easy as one might expect. A number of the small (say up to 50 residue) peptides in the PDB were solved by direct methods, but these require native data to 1.2A or (preferably) better. If sulfur is present in the molecule, SAD is a good choice and does not require such a very high resolution, but you need highly redundant data, so a high symmetry space group helps. If even one Met is present in the sequence, since you are synthesizing the peptides anyway, you can replace it with selenomethionine.

George

----------
From: Joel Tyndall

Some HIV protease structures have been done using synthetic HIV protease (99 amino acid monomers). Look at J. Martin et al from UQ in Queensland. I believe this was done with Steve Kent. The protein contains some non-natural amino acids too.

Hope this helps

Web Seminar: Home Lab SAD phasing with HKL-3000: From data collection to refined models in less than an hour

Date: 10 November 2011 23:35

Dear colleagues,

I would like to draw your attention to an upcoming free, educational webinar to be presented by Jim Pflugrath, Ph. D. titled "Home Lab SAD phasing with HKL-3000: From data collection to refined models in less than an hour." This interactive tutorial and webinar will demonstrate how to use HKL-3000 to process diffraction images, find the anomalous substructure with SHELXD, phase with SHELXC, MLPHARE, and DM, then build with ARP/wARP and refine with REFMAC (Kudos to all the authors of these programs!). Emphasis will be on practical tips and how to interpret the output. Relatively low redundancy diffraction datasets will be used as examples to dispel some of the myths about sulfur and selenium Home Lab SAD phasing.

This webinar is scheduled to occur on Thursday, November 17th 10:00 AM CST (8:00 AM PST / 4:00 PM GMT). You can find more information, including a registration link at: http://www.rigaku.com/protein/webinars.html.

Best regards,

Angela

NOTE: You can watch some of our past webinars at: http://www.rigaku.com/protein/webinars-past.html. This list of webinars includes educational topics, such as data processing with d*TREK, mosflm, XDS and HKL as well as topics on diffraction data collection. Also, don't miss the great talks and historical perspectives from industry experts such as Michael Rossmann, Brian Matthews and Ian Wilson.

Angela R. Criswell, Ph. D.

Rigaku Americas Corporation

URL: http://www.rigaku.com

How to calculate the percent of the buried hydrohobic surface area in a protein with known structure

From: Ke, Jiyuan
Date: 10 November 2011 22:10

Dear All,

I have a protein that exists as a dimer in the crystal structure. I want to calculate and compare the area of the buried hydrophobic core of a monomer with that of a dimer? Does anyone know how to do this? Thanks in advance!

Jiyuan Ke, Ph.D.

Research Scientist

Van Andel Research Institute

333 Bostwick Ave NE

Grand Rapids, MI 49503

----------
From: Cale Dakwar

Jiyuan,

I believe PISA will easily do this for you.

RFI on access to digital data

From: Gerard DVD Kleywegt
Date: 8 November 2011 22:04

Relevant to the discussion about archiving image data:

http://federalregister.gov/a/2011-28621

--Gerard

******************************************************************
Gerard J. Kleywegt

******************************************************************

----------
From: Peter Keller

Interesting that it quotes MIAME (minimum information about a microarray experiment) as an example of community-driven standardisation. When MIAME was being formulated, it drew in part, and learned from, the history of the PDB.

It is also a field where (I am told) the raw data really are useless without metadata about the experiment.

Regards,
Peter.

--
Peter Keller
Global Phasing Ltd.,

----------
From: Deacon, Ashley M.

I also found this interesting:

http://www.datadryad.org/

Ashley.

----------
From: John R Helliwell

Dear Ashley,
Excellent find!

Within it I see that one of the UK agencies states:-
vii. What resources will you require to deliver your (Data management
and sharing) plan?

Greetings,
John

--
Professor John R Helliwell DSc

small molecule crystal data collection

From: Pengfei Fang
Date: 2011/11/9

Dear All,

I have a small molecule single crystal. I want to solve its structure by x-ray diffraction.

Could you please teach me how to collect the diffraction data?

I have some experience with protein crystals. But it's the first time for small molecule.

I don't how to set the parameters, like ! oscillating angle.

And are there any key points I should pay special attention to?

Thanks in advance!

Pengfei

----------
From: George T. DeTitta

Ah - we've come full circle!

----------
From: Pengfei Fang

Thank everyone for trying to help me!

I am going to synchrotron.
As Bernie said, I will try to use the closest distance, and 5-10deg.

Thanks again!

Pengfei

On 11/9/11 5:13 PM, "Santarsiero, Bernard D." <bds@uic.edu> wrote:

You can collect 5-10deg images, and go ahead and collect 360deg. I usually
use 1-3 seconds exposures, depending on the diffraction. The detector is
set as close as possible, to get data to 0.8-0.9A resolution. For a MAR
CCD 300mm detector, that's around 90-100mm, with the wavelength set at
0.8A, or around 15.5KeV in energy.

You can process it with HKL2000 or XDS easily.

Bernie

On 11/9/11 7:36 PM, "Xiaopeng Hu"> wrote:

X-ray machine designed for protein usually is not good for small molecule. Just go find a chemistry department/school, they can do it for you in one day.

On 11/9/11 5:15 PM, "Kris Tesh"wrote:

Pengfei,
The first thing to consider is what system you have and is it capable of collecting data to high enough resolution. So, what data collection system is available?
Kris

Kris F. Tesh, Ph. D.
Department of Biology and Biochemistry
University of Houston
> I don't how to set the parameters, like oscillating angle. --
Bernard D. Santarsiero
Research Professor
Center for Pharmaceutical Biotechnology and the
Department of Medicinal Chemistry and Pharmacognosy
Center for Structural Biology
Center for Clinical and Translational Science
University of Illinois at Chicago
MC870 3070MBRB 900 South Ashland Avenue
Chicago, IL 60607-7173 USA

O in windows

Hi,

Is there anyone using O in windows environment with stereo setup? Could you please tell how to setup? I have quadro fx 3700 and able to set up stereo in coot, pymol, vmd in windows env.

Thanks...
Hena

CCP4 Study Weekend 2012

Date: 14 November 2011 17:50

Dear all,

There's just one week remaining to register at the reduced rate for the upcoming CCP4 Study Weekend in January entitled "Data Collection and Processing". The meeting is taking place at the University of Warwick, UK, from the 4th-6th of January 2012. For more details about the meeting including the programme and registration details please see the 2012 Study Weekend website:

http://www.cse.scitech.ac.uk/events/CCP4_2012/

We look forward to seeing you in Warwick.

Katherine, Johan and the CCP4 team

Post-doctoral position available at Center of Excellence Frankfurt

Date: 14 November 2011 18:45

A postdoctoral position in structural biology is available at the Center of Excellence for Macromolecular Complexes in Frankfurt, Germany (CEF, http://www.cef-mc.de). The position will start in February 2012. The position is initially available for 1 year, with possibilities for extension.

The successful candidate will work on the structural (X-ray crystallographic) and biochemical analysis of a large protein complex. The structural complexity of protein complexes requires a well trained and highly qualified post-doc. The successful post-doc candidate has to be experienced in low resolution data collection and processing, and in performing hybrid method approaches for structure solution by e.g. integrating electron microscopic data.

The CEF is located at the FMLS (Frankfurt Center for Molecular Life Sciences, www.fmls-institute.de). The building with its labs, facilities and offices is brand-new and extremely well equipped. The FMLS is part of a large campus which harbors faculties of the Goethe University Frankfurt and the Max-Planck-Institutes comprising all disciplines of life sciences. At this campus, the successful candidate finds an encouraging and inspiring atmosphere.

The successful candidate will join a young and ambitious lab which is currently moving from the Max-Planck-Institute of Biochemistry (www.biochem.mpg.de/oesterhelt/grininger) to the CEF, and will grow to 5 to 8 people within the next year.

Interested candidates should send a CV, and contact information for three references at:

grininge@biochem.mpg.de

Martin Grininger

Max-Planck-Institute of Biochemistry

Am Klopferspitz 18

82152 Martinsried/Munich

Germany

Refmac 5.6 and Twin operator

From: Kiran Kulkarni
Date: 9 November 2011 13:03

Hi All,

Is it possible to define twin operator and twin fraction in the latest version of Refmac (i.e., in version 5.6.0117) ?

I tried to define these two keywords in Buccaneer but it fails with an error message "Refmac_5.6.0117: Problem in some of the instructions".

Please help me to fix this.

Many thanks,

-Kiran

----------
From: Boaz Shaanan

Hi,

Isn't it done by the program (refmac) ?

Boaz

Boaz Shaanan, Ph.D.
Dept. of Life Sciences
Ben-Gurion University of the Negev
Beer-Sheva 84105
Israel

----------
From: Kiran Kulkarni

Hi,

Thank you for your response.

Yes! Indeed, Refmac (in Buccaneer) automatically calculates twin operators and fractions.

But these values are slightly different from those calculated from Xtriage (Though twin operators are equivalent twin fractions is different).

These difference shows up in R and Rfree when I refine the structure with Phenix and Refmac. With phenix refinement is converging where as with refmac it is not.

Furthermore, when I try to build the model using Buccaneer with an existing partially (nearly 75%) built model the program distorts structure with diverging R and Rfree.

Hence to keep the consistency I tried to key in the twin parameters from xtriage into refmac and then refine which resulted in this problem.

Hope this further clarifies my problem.

-Kiran

P.S: At present I am using Buccaneer model building option in phenix autobuild, which is progressing well. But just for the sake of comparison I tried to use Buccaneer in CCP4 with twin refinement option and encountered this problem.

----------
From: Pavel Afonine

Hi Kiran,

phenix.xtriage estimates the twin fraction, while phenix.refine (and I'm pretty sure Refmac too) refine it. That explains the difference you observe.

Pavel

[gsheldr: Re: [ccp4bb] phaser openmp]

Date: 9 November 2011 11:36

---------- Forwarded message ----------
From: "George M. Sheldrick" <gsheldr>
To: Pascal
Date: Wed, 9 Nov 2011 11:51:40 +0100
Subject: Re: [ccp4bb] phaser openmp
In my experience, writing efficient multithreaded code is much harder
than writing efficient single-thread code, and some algorithms scale
up much better than others. It is important to avoid cache misses, but
because each CPU has its own cache, on rare occasions it is possible
to scale up by more than the number of CPUs, because by dividing up
the memory the number of cache misses can be reduced. In the case of
the multi-CPU version of SHELXD (part of the current beta-test,
available on email request) I was able - with some effort - to keep
the effects of Amdahl's law within limits (on a 32 CPU machine it is
about 29 times faster than with one CPU).

George

On Wed, Nov 09, 2011 at 11:21:11AM +0100, Pascal wrote:
> Le Tue, 8 Nov 2011 16:25:22 -0800,
> Nat Echols a écrit :
>
> > On Tue, Nov 8, 2011 at 4:22 PM, Francois Berenger
> > wrote:
> > > In the past I have been quite badly surprised by
> > > the no-acceleration I gained when using OpenMP
> > > with some of my programs... :(
>
> You need big parallel jobs and avoid synchronisations, barriers or this
> kind of things. Using data reduction is much more efficient. It's working
> very well for structure factors calculations for exemple.
>
> >
> > Amdahl's law is cruel:
> >
> > http://en.wikipedia.org/wiki/Amdahl's_law
>
> You can have much less than 5% of serial code.
>
> I have more problems with L2 misse cache events and memory bandwidth. A
> quad cores means 4 times the bandwidth necessary for a single process...
> If your code is already a bit greedy, the scale up is not good.
>
> Pascal
>

--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,
D37077 Goettingen, Germany

Joint Ph. D position, NCSU & ORNL

Date: 28 November 2011 18:28

North Carolina State University (NCSU) and Oak Ridge National Laboratory (ORNL)

Applications are invited for a Ph. D fellowship in X-ray and neutron macromolecular crystallography in the laboratory of Dr. Flora Meilleur in the Molecular and Structural Biochemistry Department at NCSU and the Neutron Sciences Directorate at ORNL.

Research will focus on the elucidation of enzymatic mechanisms using a combination of high resolution X-ray and neutron crystallography. Interested candidates should hold a BS or MS degree and have a strong background in biochemistry, biophysics or a related discipline. Previous experience in molecular biology and protein purification is a plus.

The successful candidate will complete the graduate training program at North Carolina State University, in Raleigh, NC during their first year and will be located at Oak Ridge National Laboratory for the remainder of their doctoral research.

This fellowship is supported in part by an NSF IGERT (Integrative Graduate Education and Research Traineeship Program) award. The application requires US residency.

Interested candidates should submit their application at http://www.ncsu.edu/grad/applygrad.htm.

For more information on the Molecular and Structural Biochemistry department and the graduate program at NCSU, please visit: http://biochem.ncsu.edu/. For more information on the Neutron Sciences directorate, please visit: http://neutrons.ornl.gov/.

Please contact Flora Meilleur with any questions regarding the position.

Flora Meilleur

Assistant Professor

Molecular & Structural Biochemistry

N C State University

& Neutron Sciences Directorate

Oak Ridge National Laboratory

Beamline Scientist position at the EMBL-Grenoble

Date: 28 November 2011 17:06

Dear all,

I would like to inform you that a beamline scientist position at the

EMBL-Grenoble has been opened (the deadline is the 31st of

December).

The successful candidate will join the Synchrotron Crystallography

Team at the EMBL-Grenoble and work as part of the collaborative

EMBL/ESRF Joint Structural Biology Group. He/She is expected

to take a significant role in the commissioning and operation of the

new ID30A MASSIF beamline complex, which is part of the ESRF

UPBL10 Upgrade project. The candidate will also contribute to the

technical and scientific development of advanced methods for crystal

screening and associated data analysis, vital to complement the next

generation of beamline automation that will be installed on MASSIF.

Further details on these projects can be found at:

http://www.esrf.eu/AboutUs/Upgrade

http://www.esrf.eu/UsersAndScience/Experiments/MX/massif

Further information on the position and group can be found here:

http://www.embl.fr/aboutus/jobs/searchjobs/index.php?newlang=1&newms=sr&searchregion=670

http://www.embl.fr/research/unit/mccarthy/index.html

For informal discussions please contact me.

Best regards,

Andrew

Job opening for a structural biologist to join the “Biocrystallography and Structural Biology of Therapeutic Targets” group at the Institute for the Biology and Chemistry of Proteins, Lyon in France

Date: 29 November 2011 11:01

A 2 year postdoctoral position with support from the French National Research Agency (ANR) is available to study regulatory mechanisms of BMP-1/tolloid like proteinases ((BTPs; also known as procollagen C-proteinases) which are enzymes implicated in a number of key events during tissue morphogenesis and tissue repair. The research project proposed aims at a better understanding of the mechanism by which specific enhancer proteins regulate BTP activity, by solving the 3D structure of the complex of the substrate:enhancer complex.
Our studies involve a multidisciplinary approach, which combines molecular- and structural biology along with classical biochemical procedures.
A Ph.D. in biochemistry or biophysics, and a solid experience in crystallization and X-ray structure determination are required.
Background knowledge in protein expression and purification is highly desirable.

Salary in the range 30-35 k€, depending on experience.

The position is immediately available (application deadline December 15, 2011) and interested candidates should send their CV, a letter of interest and contact information of 3 references to Dr. Nushin Aghajari; n.aghajari@ibcp.fr and Dr. David Hulmes; d.hulmes@ibcp.fr.

Postdoctoral position, Institut de Biologie Stucturale, Grenoble, France

Date: 29 November 2011 10:27

Postdoctoral position, Institut de Biologie Stucturale, Grenoble, France

A two year postdoctoral position for a biochemist/structural biologist is available in the Synchrotron Group at the Institute for Structural Biology (IBS) in Grenoble (France).
The IBS has state-of-the-art equipments and facilities, including crystallization robots, NMR, EM and easy access to synchrotron beamlines. The subject is focused on the structural analysis of a CDK/Cyclin complex, in a drug design approach. The candidates should hold a Ph.D. in biochemistry or biophysics. A solid experience in protein expression, purification and biochemical characterization is required. Knowledge and experience in protein crystallography and in in silico docking will be an advantage. The candidates must be motivated, well-organized and able to work independently as well as a part of the team.
Salary is in the range 30-35 k€, depending on experience. The position is immediately available and interested candidates should send their CV to franck.borel@ibs.fr

Postdoc and PhD student positions at the Biozentrum (Basel, Switzerland)

Date: 29 November 2011 10:24

Postdoctoral and PhD student positions are available at the Biozentrum,
University of Basel, Switzerland, in the group of Timm Maier to study
eukaryotic lipid metabolism and its regulation through structural and
functional work on very large multifunctional enzymes and membrane
proteins (cf. Science 321:1325, Nature, 459:726, Q Rev Biophys, 43:373).

We are currently expanding our research team and are looking for
enthusiastic scientists, who share our dedication to solve challenging
problems in biology (see www.biozentrum.unibas.ch). Our approach builds on
X-ray crystallography as a key method to obtain insights at atomic
resolution in combination with strategies for the stabilization, labeling
and trapping of macromolecules. Biophysical characterization of
macromolecular interactions and functional analysis complement our
structural studies. The core program "Structural Biology and Biophysics"
of the Biozentrum offers a stimulating interdisciplinary environment and
state-of-the-art facilities for all aspects of modern structural biology.

Postdoc candidates should have recently obtained their PhD with practical
experience in crystallographic structure determination or production of
difficult protein targets and are expected to have demonstrated their
scientific excellence by at least one publication as first author.The
Biozentrum is dedicated to excellence in research and education. It
provides outstanding support and training for PhD students
(www.biozentrum.unibas.ch/education). PhD candidates should have a recent
master's degree in biological sciences including practical lab work in
molecular biology or protein production. Prior experience in structural
biology is an advantage but not required. Basel is a global centre for
biological research, with major academic institutions and research
departments of leading life science companies. The Biozentrum is an
internationally renowned institute for basic life science research. With
over 500 staff members from 40 countries, it is the largest natural
science department of the University of Basel.

Please send your application (electronically as a single pdf document)
including CV, publication record and a concise outline of prior research
experience together with addresses of up to three references to
timm.maier@unibas.ch. The University of Basel is an equal opportunity
employer and encourages applications from female candidates. Informal
inquiries via e-mail are welcome.

Leader of Biophysics Facility at the Biozentrum Basel

Date: 29 November 2011 10:22

The Biozentrum of the University of Basel, an internationally renowned
Life Sciences research institute, invites applications for the Leader of
its new Biophysics Facility. We are looking for an enthusiastic person
that combines technical expertise with communication skills. The ideal
candidate is a PhD level scientist with practical experience and excellent
theoretical knowledge in biophysical methods for the characterization of
biomacromolecules and their interactions, including optical spectroscopy
(fluorescence, circular dichroism), microcalorimetry, analytical
ultracentrifugation and surface plasmon resonance. He/she enjoys providing
excellent user training and support for measurements and data evaluation,
maintains and develops an assembly of state-of-the art instruments,
contributes to practical and theoretical training for students and
promotes scientific collaboration in the Biozentrum.

The Biozentrum offers an outstanding scientific environment and a
competitive salary, while Basel provides a high standard of living and a
superb cultural atmosphere. Applications including CV, list of
publications and a short summary of past experience, have to be
transmitted in electronic form to Zs-Biozentrum@unibas.ch.

This position is open until filled; we aim for a starting date in 2012.
For informal enquiries please contact Prof. S. Grzesiek
(stephan.grzesiek-at-unibas.ch, phone: +41 (0) 61 267 21 00). The
University of Basel is an equal opportunity employer and encourages
applications from female candidates.

Monday 28 November 2011

phaser openmp

From: Ed Pozharski
Date: 8 November 2011 17:29

Could anyone point me towards instructions on how to get/build
parallelized phaser binary on linux? I searched around but so far found
nothing. The latest updated phaser binary doesn't seem to be
parallelized.

Apologies if this has been resolved before - just point at the relevant
thread, please.

--
"I'd jump in myself, if I weren't so good at whistling."
Julian, King of Lemurs

----------
From: Dr G. Bunkoczi

Hi Ed,

in the CCP4 distribution, openmp is not enabled by default, and there
seems to be no easy way to enable it (i.e. by setting a flag at the
configure stage).

On the other hand, you can easily create a separate build for phaser
that is openmp enabled and use phaser from there. To do this, create a
new folder, say "phaser-build", cd into it, and issue the following
commands (this assumes you are using bash):

$ python $CCP4/lib/cctbx/cctbx_sources/cctbx_project/libtbx/configure.py
--repository=$CCP4/src/phaser/source phaser
--build-boost-python-extensions=False --enable-openmp-if-possible=True

$ . ./setpaths.sh ("source ./setpaths.csh" with csh) $ libtbx.scons (if you have several CPUs, add -jX where X is the number of CPUs you want to use for compilation)

This will build phaser that is openmp-enabled. You can also try passing
the --static-exe flag (to configure.py), in which case the executable is
static and can be relocated without any headaches. This works with
certain compilers.

Let me know if there are any problems!

BW, Gabor

----------
From: Francois Berenger

Hello,

How faster is the OpenMP version of Phaser
versus number of cores used?

In the past I have been quite badly surprised by
the no-acceleration I gained when using OpenMP
with some of my programs... :(

Regards,
F.

----------
From: Nat Echols

Amdahl's law is cruel:

http://en.wikipedia.org/wiki/Amdahl's_law

This is the same reason why GPU acceleration isn't very useful for
most crystallography software.

-Nat

----------
From: Ed Pozharski

See page 3 of this

http://www-structmed.cimr.cam.ac.uk/phaser/ccp4-sw2011.pdf

----------
From: Randy Read

Thanks for pointing out that link. The graph makes the point I was going to mention, i.e. that you notice a big difference in using up to about 4 processors for typical jobs, but after that point the non-parallelisable parts of the code start to dominate and there's less improvement. This is very useful if you have one MR job to run on a typical modern workstation (2-8 cores), but if you have several separate jobs to run then you're better off submitting them simultaneously, each using a subset of the available cores. Of course, that assumes you have enough memory for several simultaneous separate jobs!

Regards,

Randy Read

------
Randy J. Read
Department of Haematology, University of Cambridge

----------
From: Pascal

Le Tue, 8 Nov 2011 16:25:22 -0800,
Nat Echols a écrit :
You need big parallel jobs and avoid synchronisations, barriers or this
kind of things. Using data reduction is much more efficient. It's working
very well for structure factors calculations for exemple.
You can have much less than 5% of serial code.

I have more problems with L2 misse cache events and memory bandwidth. A
quad cores means 4 times the bandwidth necessary for a single process...
If your code is already a bit greedy, the scale up is not good.

Pascal

----------
From: Francois Berenger

I never went down to this level of optimization.
Are you using valgrind to detect cache miss events?

After gprof, usually I am done with optimization.
I would prefer to change my algorithm and would be afraid
of introducing optimizations that are architecture-dependent
into my software.

Regards,
F.

----------
From: Pascal

No, I am not sure valgrind can cope with multithread applications correctly.
In this particular case my code is running faster on a intel Q9400@2.67GHz with
800MHz DDR2 than an intel Q9505@2.83GHz with 667MHz DDR2. Also I have
a nice scale up on a 4*12 opteron cpu (each cpu has 2 dual channel memory bus)
but not on my standard quad core. If I get my hands on a i7-920 equipped with
triple channel DDR3 the program should run much faster despite the same cpu clock.

Then I used perf[1] and oprofile[2] on linux.
Have a look here for the whole story:
<http://blog.debroglie.net/2011/10/25/cpu-starvation/> When I spot a bottle neck, it's my first reaction, changing the algorithm.
Caching calculations, more efficient algorithms...

But once I had to do some manual loop tilling. It's kind of a change of algorithm
as the size of a temporary variable change as well but the number of operations
remains the same. The code with the loop tilling is ~20% faster. Only due to a
better use of the cpu cache.
<http://blog.debroglie.net/2011/10/28/loop-tiling/>

[1] http://kernel.org/ package name should be perf-util or similar
[2] http://oprofile.sourceforge.net/

Pascal

Sunday 27 November 2011

image compression

From: James Holton
Date: 7 November 2011 17:30

At the risk of sounding like another "poll", I have a pragmatic question for the methods development community:

Hypothetically, assume that there was a website where you could download the original diffraction images corresponding to any given PDB file, including "early" datasets that were from the same project, but because of smeary spots or whatever, couldn't be solved. There might even be datasets with "unknown" PDB IDs because that particular project never did work out, or because the relevant protein sequence has been lost. Remember, few of these datasets will be less than 5 years old if we try to allow enough time for the original data collector to either solve it or graduate (and then cease to care). Even for the "final" dataset, there will be a delay, since the half-life between data collection and coordinate deposition in the PDB is still ~20 months. Plenty of time to forget. So, although the images were archived (probably named "test" and in a directory called "john") it may be that the only way to figure out which PDB ID is the "right answer" is by processing them and comparing to all deposited Fs. Assume this was done. But there will always be some datasets that don't match any PDB. Are those interesting? What about ones that can't be processed? What about ones that can't even be indexed? There may be a lot of those! (hypothetically, of course).

Anyway, assume that someone did go through all the trouble to make these datasets "available" for download, just in case they are interesting, and annotated them as much as possible. There will be about 20 datasets for any given PDB ID.

Now assume that for each of these datasets this hypothetical website has two links, one for the "raw data", which will average ~2 GB per wedge (after gzip compression, taking at least ~45 min to download), and a second link for a "lossy compressed" version, which is only ~100 MB/wedge (2 min download). When decompressed, the images will visually look pretty much like the originals, and generally give you very similar Rmerge, Rcryst, Rfree, I/sigma, anomalous differences, and all other statistics when processed with contemporary software. Perhaps a bit worse. Essentially, lossy compression is equivalent to adding noise to the images.

Which one would you try first? Does lossy compression make it easier to hunt for "interesting" datasets? Or is it just too repugnant to have "modified" the data in any way shape or form ... after the detector manufacturer's software has "corrected" it? Would it suffice to simply supply a couple of "example" images for download instead?

-James Holton
MAD Scientist

----------
From: Herbert J. Bernstein

This is a very good question. I would suggest that both versions
of the old data are useful. If was is being done is simple validation
and regeneration of what was done before, then the lossy compression
should be fine in most instances. However, when what is being
done hinges on the really fine details -- looking for lost faint
spots just peeking out from the background, looking at detailed
peak profiles -- then the lossless compression version is the
better choice. The annotation for both sets should be the same.
The difference is in storage and network bandwidth.

Hopefully the fraud issue will never again rear its ugly head,
but if it should, then having saved the losslessly compressed
images might prove to have been a good idea.

To facilitate experimentation with the idea, if there is agreement
on the particular lossy compression to be used, I would be happy
to add it as an option in CBFlib. Right now all the compressions
we have are lossless.

Regards,
Herbert

=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
=====================================================

----------
From: James Holton

So far, all I really have is a "proof of concept" compression algorithm here:
http://bl831.als.lbl.gov/~jamesh/lossy_compression/

Not exactly "portable" since you need ffmpeg and the x264 libraries
set up properly. The latter seems to be constantly changing things
and breaking the former, so I'm not sure how "future proof" my
"algorithm" is.

Something that caught my eye recently was fractal compression,
particularly since FIASCO has been part of the NetPBM package for
about 10 years now. Seems to give comparable compression vs quality
as x264 (to my eye), but I'm presently wondering if I'd be wasting my
time developing this further? Will the crystallographic world simply
turn up its collective nose at lossy images? Even if it means waiting
6 years for "Nielsen's Law" to make up the difference in network
bandwidth?

-James Holton
MAD Scientist

----------
From: Herbert J. Bernstein

Dear James,

You are _not_ wasting your time. Even if the lossy compression ends
up only being used to stage preliminary images forward on the net while
full images slowly work their way forward, having such a compression
that preserves the crystallography in the image will be an important
contribution to efficient workflows. Personally I suspect that
such images will have more important, uses, e.g. facilitating
real-time monitoring of experiments using detectors providing
full images at data rates that simply cannot be handled without
major compression. We are already in that world. The reason that
the Dectris images use Andy Hammersley's byte-offset compression,
rather than going uncompressed or using CCP4 compression is that
in January 2007 we were sitting right on the edge of a nasty CPU-performance/disk bandwidth tradeoff, and the byte-offset
compression won the competition. In that round a lossless
compression was sufficient, but just barely. In the future,
I am certain some amount of lossy compression will be
needed to sample the dataflow while the losslessly compressed
images work their way through a very back-logged queue to the disk.

In the longer term, I can see people working with lossy compressed
images for analysis of massive volumes of images to select the
1% to 10% that will be useful in a final analysis, and may need
to be used in a lossless mode. If you can reject 90% of the images
with a fraction of the effort needed to work with the resulting
10% of good images, you have made a good decision.

An then there is the inevitable need to work with images on
portable devices with limited storage over cell and WIFI networks. ...

I would not worry about upturned noses. I would worry about
the engineering needed to manage experiments. Lossy compression
can be an important part of that engineering.

Regards,
Herbert

--
Dowling College, Brookhaven Campus, B111B
1300 William Floyd Parkway, Shirley, NY, 11967
=====================================================

----------
From: Frank von Delft

I'll second that... can't remember anybody on the barricades about "corrected" CCD images, but they've been just so much more practical.

Different kind of problem, I know, but equivalent situation: the people to ask are not the purists, but the ones struggling with the huge volumes of data. I'll take the lossy version any day if it speeds up real-time evaluation of data quality, helps me browse my datasets, and allows me to do remote but intelligent data collection.

phx.

----------
From: Miguel Ortiz Lombardia

So the purists of speed seem to be more relevant than the purists of images.

We complain all the time about how many errors we have out there in our
experiments that we seemingly cannot account for. Yet, would we add
another source?

Sorry if I'm missing something serious here, but I cannot understand
this artificial debate. You can do useful remote data collection without
having look at *each* image.

Miguel

--
Miguel

----------
From: Jan Dohnalek

I think that real universal image depositions will not take off without a newish type of compression that will speed up and ease up things.
Therefore the compression discussion is highly relevant - I would even suggest to go to mathematicians and software engineers to provide
a highly efficient compression format for our type of data - our data sets have some very typical repetitive features so they can be very likely compressed as a whole set without loosing information (differential compression in the series) but this needs experts ..

Jan Dohnalek

--
Jan Dohnalek, Ph.D
Institute of Macromolecular Chemistry
Academy of Sciences of the Czech Republic
Heyrovskeho nam. 2
16206 Praha 6
Czech Republic

----------
From: Graeme Winter

HI James,

Regarding the suggestion of lossy compression, it is really hard to
comment without having a good idea of the real cost of doing this. So,
I have a suggestion:

- grab a bag of JCSG data sets, which we know should all be essentially OK.
- you squash then unsquash them with your macguffin, perhaps
randomizing them as to whether A or B is squashed.
- process them with Elves / xia2 / autoPROC (something which is reproducible)
- pop the results into pdb_redo

Then compare the what-comes-out. Ultimately adding "noise" may (or may
not) make a measurable difference to the final refinement - this may
be a way of telling if it does or doesn't. Why however would I have
any reason to worry? Because the noise being added is not really
random - it will compression artifacts. This could have a subtle
effect on how the errors are estimated and so on. However you can hum
and haw about this for a decade without reaching a conclusion.

Here, it's something which in all honesty we can actually evaluate, so
is it worth giving it a go? If the results were / are persuasive (i.e.
a "report on the use of lossy compression in transmission and storage
of X-ray diffraction data" was actually read and endorsed by the
community) this would make it much more worthwhile for consideration
for inclusion in e.g. cbflib.

I would however always encourage (if possible) that the original raw
data is kept somewhere on disk in an unmodified form - I am not a fan
of one-way computational processes with unique data.

Thoughts anyone?

Cheerio,

Graeme

----------
From: Kay Diederichs

Hi James,

I see no real need for lossy compression datasets. They may be useful for demonstration purposes, and to follow synchrotron data collection remotely. But for processing I need the real data. It is my experience that structure solution, at least in the difficult cases, depends on squeezing out every bit of scattering information from the data, as much as is possible with the given software. Using a lossy-compression dataset in this situation would give me the feeling "if structure solution does not work out, I'll have to re-do everything with the original data" - and that would be double work. Better not start going down that route.

The CBF byte compression puts even a 20bit detector pixel into a single byte, on average. These frames can be further compressed, in the case of Pilatus fine-slicing frames, using bzip2, almost down to the level of entropy in the data (since there are so many zero pixels). And that would be lossless.

Storing lossily-compressed datasets would of course not double the diskspace needed, but would significantly raise the administrative burdens.

Just to point out my standpoint in this whole discussion about storage of raw data:
I've been storing our synchrotron datasets on disks, since 1999. The amount of money we spend per year for this purpose is constant (less than 1000€). This is possible because the price of a GB disk space drops faster than the amount of data per synchrotron trip rises. So if the current storage is full (about every 3 years), we set up a bigger RAID (plus a backup RAID); the old data, after copying over, always consumes only a fraction of the space on the new RAID.

So I think the storage cost is actually not the real issue - rather, the real issue has a strong psychological component. People a) may not realize that the software they use is constantly being improved, and that needs data which cover all the corner cases; b) often do not wish to give away something because they feel it might help their competitors, or expose their faults.

best,

Kay (XDS co-developer)

----------
From: Harry Powell

Hi
I agree.

Harry
--
Dr Harry Powell, MRC Laboratory of Molecular Biology, MRC Centre, Hills Road, Cambridge, CB2 0QH

http://www.iucr.org/resources/commissions/crystallographic-computing/schools/mieres2011

----------
From: Miguel Ortiz Lombardía

Le 08/11/11 10:15, Kay Diederichs a écrit :

Hi Kay and others,

I completely agree with you.

Datalove, <3
:-)

----------
From: Herbert J. Bernstein

Um, but isn't Crystallograpy based on a series of
one-way computational processes:
photons -> images
images -> {struture factors, symmetry}
{structure factors, symmetry, chemistry} -> solution
{structure factors, symmetry, chemistry, solution}
-> refined solution

At each stage we tolerate a certain amount of noise
in "going backwards". Certainly it is desirable to
have the "original data" to be able to go forwards,
but until the arrival of pixel array detectors, we
were very far from having the true original data,
and even pixel array detectors don't capture every
single photon.

I am not recommending lossy compressed images as
a perfect replacement for lossless compressed images,
any more than I would recommend structure factors
are a replacement for images. It would be nice
if we all had large budgets, huge storage capacity
and high network speeds and if somebody would repeal
the speed of light and other physical constraints, so that
engineering compromises were never necessary, but as
James has noted, accepting such engineering compromises
has been of great value to our colleagues who work
with the massive image streams of the entertainment
industry. Without lossy compression, we would not
have the _higher_ image quality we now enjoy in the
less-than-perfectly-faithful HDTV world that has replaced
the highly faithful, but lower capacity, NTSC/PAL world.

Please, in this, let us not allow the perfect to be
the enemy of the good. James is proposing something
good. Professor of Mathematics and Computer Science

Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
=====================================================

----------
From: Graeme Winter

Dear Herbert,

Sorry, the point I was getting at was that the process is one way, but
if it is also *destructive* i.e. the original "master" is not
available then I would not be happy. If the master copy of what was
actually recorded is available from a tape someplace perhaps not all
that quickly then to my mind that's fine.

When we go from images to intensities, the images still exist. And by
and large the intensities are useful enough that you don't go back to
the images again. This is worth investigating I believe, which is why
I made that proposal.

Mostly I listen to mp3's as they're convenient, but I still buy CD's
not direct off e.g. itunes, and yes a H264 compressed video stream is
much nicer to watch than VHS.

Best wishes,

Graeme

On 8 November 2011 12:17, Herbert J. Bernstein

----------
From: James Holton

At the risk of putting this thread back on-topic, my original question was not "should I just lossfully compress my images and throw away the originals". My question was:

"would you download the compressed images first?"

So far, noone has really answered it.

I think it is obvious that of course we would RATHER have the original data, but if access to the original data is "slow" (by a factor of 30 at best) then can the "mp3 version" of diffraction data play a useful role in YOUR work?

Taking Graeme's request from a different thread as an example, he would like to see stuff in P21 with a 90 degree beta angle. There are currently ~609 examples of this in the PDB. So, I ask again: "which one would you download first?". 1aip? (It is first alphabetically). Then again, if you just email the corresponding authors of all 609 papers, the response rate alone might whittle the number of datasets to deal with down to less than 10. Perhaps even less than 1.

-James Holton
MAD Scientist

----------
From: Graeme Winter

Hi James,

Fair enough.

However I would still be quite interested to see how different the
results are from the originals and the compressed versions. If the
differences were pretty minor (i.e. not really noticeable) then I
would certainly have a good look at the mp3 version.

Also it would make my data storage situation a little easier, at least
what I use for routine testing. Worth a go?

Cheerio,

Graeme

----------
From: Miguel Ortiz Lombardia

Le 08/11/2011 19:19, James Holton a écrit :
Hmm, I thought I had been clear. I will try to be more direct:

Given the option, I would *only* download the original,
non-lossy-compressed data. At the expense of time, yes. I don't think
Graeme's example is very representative of our work, sorry.

As long as the option between the two is warranted, I don't care. I just
don't see the point for the very same reasons Kay has very clearly exposed.

Best regards,

----------
From: <mjvdwoerd

Hmmm, so you would, when collecting large data images, say 4 images, 100MB in size, per second, in the middle of the night, from home, reject seeing compressed images on your data collection software, while the "real thing" is lingering behind somewhere, to be downloaded and stored later? As opposed to not seeing the images (because your home internet access cannot keep up) and only inspecting 1 in a 100 images to see progress?

I think there are instances where compressed (lossy or not) images will be invaluable. I know the above situation was not the context, but (y'all may gasp about this) I still have some friends (in the US) who live so far out in the wilderness that only dial-up internet is available. That while synchrotrons and the detectors used get better all the time, which means more MB/s produced.

James has already said (and I agree) that the original images (with all information) should not necessarily be thrown away. Perhaps a better question would be "which would you use for what purpose", since I am convinced that compressed images are useful.

I would want to process the "real thing", unless I have been shown by scientific evidence that the compressed thing works equally well. It seems reasonable to assume that such evidence can be acquired and/or that we can be shown by evidence what we gain and lose by lossy-compressed images. Key might be to be able to choose the best thing for your particular application/case/location etc.

So yes, James, of course this is useful and not a waste of time.

Mark

----------
From: Miguel Ortiz Lombardia

Le 08/11/2011 20:46, mjvdwoerd@netscape.net a écrit :
1. I don't need to *see* all images to verify whether the collection is
going all right. If I collect remotely, I process remotely, no need to
transfer images. Data is collected so fast today that you may, even
while collecting at the synchrotron, finish the collection without a)
seeing actually all the images (cf. Pilatus detectors) b) keeping in
pace at all your data processing. The crystal died or was not collected
properly? You try to understand why, you recollect it if possible or you
try a new crystal. It's been always like this, it's call trial and error.

2. The ESRF in Grenoble produces thumbnails of the images. If all you
want to see is whether there is diffraction, they are good enough and
they are useful. They are extremely lossy and useless for anything else.

3. Please, compare contemporary facts. Today's bandwidth is what it is,
today's images are *not* 100 Mb (yet). When they get there, let us know
what is the bandwidth.
I would understand a situation like the one you describe for a poor, or
an embargoed country where unfortunately there is no other way to
connect to a synchrotron. Still, that should be solved by the community
in a different way: by gracious cooperation with our colleagues in those
countries. Your example is actually quite upsetting, given the current
state of affairs in the world.
I think I was clear: as long as we have access to the original data, I
don't care. I would only use the original data.
This still assumes that future software will not be able to detect the
differences that you cannot see today. This may or may not be true, the
consequences may or may not be important. But there is, I think,
reasonable doubt on both questions.
I have said to James, off the list, that he should go on if he's
convinced about the usefulness of his approach. For a very scientific
reason: I could be wrong. Yet, if need be to go into the compression
path, I think we should prefer lossless options.

----------
From: Phil Evans

It would be a good start to get all images written now with lossless compression, instead of the uncompressed images we still get from the ADSC detectors. Something that we've been promised for many years

Phil

----------
From: Herbert J. Bernstein

ADSC has been a leader in supporting compressed CBF's.

=====================================================
Herbert J. Bernstein
Professor of Mathematics and Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
=====================================================

----------
From: William G. Scott

The mp3/music analogy might be quite appropriate.

On some commercial music download sites, there are several options for purchase, ranging from audiophool-grade 24-bit, 192kHz sampled music, to CD-quality (16-bit, 44.1kHz), to mp3 compression and various lossy bit-rates. I am told that the resampling and compression is actually done on the fly by the server, from a single master, and the purchaser chooses what files to download based on cost, ability to play high-res data, degree of canine-like hearing, intolerance for lossy compression with its limited dynamic range, etc.

Perhaps that would be the best way to handle it from a central repository, allowing the end-user to decide on the fly. The lossless files could somehow be tagged as such, to avoid confusion.

Bill

William G. Scott
Professor
Department of Chemistry and Biochemistry
USA

Saturday 26 November 2011

Archiving Images for PDB Depositions

From: Jacob Keller
Date: 31 October 2011 16:02

Dear Crystallographers,

I am sending this to try to start a thread which addresses only the
specific issue of whether to archive, at least as a start, images
corresponding to PDB-deposited structures. I believe there could be a
real consensus about the low cost and usefulness of this degree of
archiving, but the discussion keeps swinging around to all levels of
archiving, obfuscating who's for what and for what reason. What about
this level, alone? All of the accompanying info is already entered
into the PDB, so there would be no additional costs on that score.
There could just be a simple link, added to the "download files"
pulldown, which could say "go to image archive," or something along
those lines. Images would be pre-zipped, maybe even tarred, and people
could just download from there. What's so bad?

The benefits are that sometimes there are structures in which
resolution cutoffs might be unreasonable, or perhaps there is some
potential radiation damage in the later frames that might be
deleterious to interpretations, or perhaps there are ugly features in
the images which are invisible or obscure in the statistics.

In any case, it seems to me that this step would be pretty painless,
as it is merely an extension of the current system--just add a link to
the pulldown menu!

Best Regards,

Jacob Keller

--
*******************************************
Jacob Pearson Keller
Northwestern University
Medical Scientist Training Program
*******************************************

----------
From: Adrian Goldman

I have no problem with this idea as an opt-in. However I loathe being forced to do things - for my own good or anyone else's. But unless I read the tenor of this discussion completely wrongly, opt-in is precisely what is not being proposed.

Adrian Goldman

----------
From: Jacob Keller

Pilot phase, opt-in--eventually, mandatory? Like structure factors?

Jacob

----------
From: Frank von Delft

"Loathe being forced to do things"? You mean, like being forced to use programs developed by others at no cost to yourself?

I'm in a bit of a time-warp here - how exactly do users think our current suite of software got to be as astonishingly good as it is? 10 years ago people (non-developers) were saying exactly the same things - yet almost every talk on phasing and auto-building that I've heard ends up acknowledging the JCSG datasets.

Must have been a waste of time then, I suppose.

phx.

----------
From: Clemens Vonrhein

Dear Adrian,
I understood it slightly different - see Gerard Bricogne's points in

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1110&L=CCP4BB&F=&S=&P=363135

which sounds very much like an opt-in? Such a starting point sounds
very similar to that we had with initial PDB submission (optional for
publication) and then structure factor deposition.

Cheers

Clemens

--

***************************************************************
* Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com
*
* Global Phasing Ltd.
* Sheraton House, Castle Park
* Cambridge CB3 0AX, UK
*--------------------------------------------------------------
* BUSTER Development Group (http://www.globalphasing.com)
***************************************************************

----------
From: Anastassis Perrakis

Dear all,

Apologies for a lengthy email in a lengthy chain of emails.

I think Jacob did here a good job refocusing the question. I will try to answer it in a rather simplistic manner,

but from the view point of somebody who might only have relatively little time in the field, but has enjoyed the

privilege of seeing it both from the developer and from the user perspective, as well as from environments

as the synchrotron service-oriented sites, as well as from a cancer hospital. I will only claim my weight=1 obviously,

but I want to emphasize that where you stand influences your perspective.

let me first present the background that shapes my views.

When we started with ARP/wARP (two decades for Victor and getting pretty close for myself!), we (like others) hardly

had the benefit of large datasets. We had some friends that gladly donated their data to us to play with,

and we have assembled enough data to aid our primitive efforts back then. The same holds true for many.

At some point, around 2002, we started XtalDepot with Serge Cohen: the idea was to systematically collect phased data,

moving one step away from HKL F/SigF to include either HLA/B/C/D or the search model for the molecular replacement solution.

Despite several calls, that archive only acquired around hundred structures, and yesterday morning was taken off-line

as it was not useful any more and was not visited by anyone any more. Very likely, our effort was redundant because of the JCSG

dataset, which has been used by many and many people who are grateful for it (I guess the 'almost every talk' of Frank refers to me,

I have never used the JCSG set).

Lately, I am involved to the PDB_REDO project, who was pioneered by Gert Vriend and Robbie Joosten (who is now in my lab).

Thanks to Gerard K. EDS clean-up and subsequent effort of both Robbie and Garib who made gadzillions of fixes to refmac,

now we can not only make maps of PDB entries, but also refine them - all but less than 100 structures. That has costed a significant part of

the last four-five years of Robbie's life (and has received limited appreciation from editors of 'important' journals and from referees of our grants).

</you can skip this>

These experiences are what shapes my view, and my train of thought goes like this:

The PDB collected F/sigF, and to be able to really use them to get maps first, to re-refine later, and re-build now, has received rather

limited attention. It starts to have impact to some fields, mostly to modeling efforts and unlike referee nr.3 I strongly believe it

has a great potential for impact.

My team collected also phases, so did JCSG in a more successful and consistent scale,

and that effort has been used indeed by developers to deliver better benchmarking

of many software (to my knowledge it has escaped my attention if anyone used JCSG data directly for eg by learning techniques,

but I apologize if I have missed that). This benchmarking of software, based on 'real' maps for a rather limited set of data,

hundreds and not tens of thousands, was important enough anyway.

That leads me to conclude that archiving images is a good idea on a voluntary basis. Somebody who needs it should convince the funding bodies

to make the money available, and then take the effort to make the infrastructure available. I would predict then 100-200 datasets would be collected,

and that would really really help developers to make these important new algorithms and software we all need. Thats a modest investment,

that can teach us a lot. One of the SG groups can make this effort and most of us would support it, myself included.

Would such data help more than the developers? I doubt it. Is it important to make such a resource available to developers? Absolutely?

What is the size of the resource needed? Limited to a few hundreds of datasets, that can be curated and stored on a modest budget.

Talking about archiving in a PDB-scale might be fantastic in principle, but it would require time and resources to a scale that would not clearly stand the

cost-benefit trial, especially at times of austerity.

In contrast, a systematic effort of our community to deposit DNA in existing databanks like AddGene.com, and annotate PDB entries with such deposition

numbers, would be cheap, efficient, and could have far-reaching implications for many people that could really get easily the DNA to start studying

structures in the database. That would surely lead to new science, because people interested enough in these structures to claim the DNA and

'redo' the project would add new science. One can imagine even SG centers offering such a service 'please redo structure X for this and that reason',

for a fee that would represent the real costs, that must be low given the investment already existing in experience and technology over there - a subset

of targets could be on a 'request' basis...

Sorry for getting wild ... we can of course now have a referendum to decide in the best curse of action! :-(

PS Rob, you are of course right about sequencing costs, but I was only trying to paint the bigger picture...

Anastassis (Tassos) Perrakis, Principal Investigator / Staff Member

Department of Biochemistry (B8)

Netherlands Cancer Institute,

Dept. B8, 1066 CX Amsterdam, The Netherlands

----------
From: Anastassis Perrakis

To avoid misunderstandings, since I received a couple of emails already:

? was a typo. I meant Absolutely!

I think such data are essential for development of better processing software, and I find the development of better

processing software of paramount importance!

Curse was not a typo.

I am Greek. Today, thinking of referendums, I see many curses of action, and limited courses of action.

----------
From: George M. Sheldrick

Speaking as a part-time methods developer, I agree with Tassos that a couple
of hundred suitably chosen and documented datasets would be adequate for most
purposes. I find that it is always revealing to be able to compare a new
algorithm with existing attempts to solve the same problem, and this is much
easier if we use the same data when reporting such tests. Since I am most
interested in phasing, all I need are unmerged reflection datasets and a PDB
file of the final model. It would be a relatively small extension of the
current deposition requirements to ask depositors to provide unmerged
intensities and sigI for the data collected for phasing as well as for the
final refinement. This would also provide useful additional information for
validation (even where experimental phasing failed and the structure was
solved by MR).

George

--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,

----------
From: Gerard Bricogne

Dear Tassos,

If you apologise for a long e-mail in a long chain of them, I don't
know with what oratory precautions I should preface mine ... . I will
instead skip the disclaimers and try to remain brief.

It seems to me that there is a slight paradox, or inconsistency, in
your position. You concur with the view, expressed by many and just now
supported by George, that developers could perfectly well do their job on
the basis of relatively small collections of test datasets that they could
assemble through their own connections or initiative. I mostly agree with
this. So the improvements will take place, perhaps not to the same final
degree of robustness, but to a useful degree nevertheless. What would be
lost, however, is the possibility of reanalysing all other raw image
datasets to get the benefit of those new developments on the data associated
with the large number of other pdb entries for which they would have been
deposited if a scheme such as what has been proposed was put in place, but
will not have been otherwise.

Well and good, have said many, but who would do that anyway? And what
benefit would it bring? I understand the position of these sceptics, but I
do not see how you, dear Tassos, of all people, can be of this opinion, when
you have just in the previous sentence sung the praises of Gert and Robbie
and PDB-REDO, as well as expressed regret that this effort remains greatly
underappreciated. If at the time of discussing the deposition of structure
factor data people had used your argument (that it is enough for developers
to gather their own portfolio of test sets of such data from their friends
and collaborators) we could perhaps have witnessed comparable improvements
in refinement software, but there would have been no PDB-REDO because the
data for running it would simply not have been available! ;-) . Or do you
think the parallel does not apply?

One of the most surprising aspects of this overall discussion has been
the low value that seems to be given, in so many people's opinion, to that
possibility of being able to find improved results for pdb entries one would
return to after a while - as would be the case of a bottle of good wine that
would have matured in the cellar. OK, it can be a bit annoying to have to
accept that anyone could improve one's *own* results; but being able to find
better versions of a lot other people's structures would have had, I would
have thought, some value. From the perspective of your message, then, why
are the benefits of PDB-REDO so unique that PDB-REPROCESS would have no
chance of measuring up to them?

That is the best I managed to do to keep this reply brief :-) .

With best wishes,

Gerard.

--

--

===============================================================
* *
* Gerard Bricogne gb10@GlobalPhasing.com *
* *
* Sheraton House, Castle Park
* Cambridge CB3 0AX, UK
* *
===============================================================

----------
From: Edward A. Berry

Gerard Bricogne wrote:

. . . . the view, expressed by many and just now

Well, lets put it to the test-
Let one developer advertise on this board a request for the type
of datasets (s)he would like to have as test case for current project.
The assumption is that it is out there. See whether or not people
recognize their data as fitting the request and voluntarily supply
it, or we need this effort to make all data available and (what would
be more burdensome) annotate it sufficiently so the same developer
looking for a particular pathology would be able to find it among
the petabytes of other data.

I seem to remember two or three times in the past 18 years that
such requests have been made (and there is the standing request to make
data submitted to the aRP/wARP server available to the developers),
and i assumed the developers were getting what they wanted.
Maybe not- maybe they found no one responds to those requests so they
stopped making them.
Ed

----------
From: Anastassis Perrakis

Dear Gerard

Isolating your main points: ... I was thinking of the inconsistency while sending my previous email ... ;-)

Basically, the parallel does apply. PDB-REPROCESS in a few years would
be really fantastic - speaking as a crystallographer and methods developer.

Speaking as a structural biologist though, I did think long and hard about
the usefulness of PDB_REDO. I obviously decided its useful since I am now
heavily involved in it for a few reasons, like uniformity of final model treatment,
improving refinement software, better statistics on structure quality metrics,
and of course seeing if the new models will change our understanding of
the biology of the system.

An experiment that I would like to do as a structural biologist - is the following:
What about adding an "increasing noise" model to the Fobs's of a few datasets and re-refining?
How much would that noise change the final model quality metrics and in absolute terms?

(for the changes that PDB_RE(BUILD) does have a preview at http://www.ncbi.nlm.nih.gov/pubmed/22034521
....I tried to avoid the shamelessly self-promoting plug-in, but could resists at the end!)

That experiment - or a better designed variant for it ? - would maybe tell us if we should be advocating the archive of all images,
and being scientifically convinced of the importance of that beyond methods development, we would all argue a strong case
to the funding and hosting agencies.

Tassos

PS Of course, that does not negate the all-important argument, that when struggling with marginal
data better processing software is essential. There is a clear need for better software
to process images, especially for low resolution and low signal/noise cases.
Since that is dependent on having test data I am all for supporting an initiative to collect such data,
and I would gladly spend a day digging our archives to contribute.

----------
From: James Holton

On general scientific principles the reasons for archiving "raw data" all boil down to one thing: there was a systematic error, and you hope to one day account for it. After all, a "systematic error" is just something you haven't modeled yet. Is it worth modelling? That depends...

There are two main kinds of systematic error in MX:
1) Fobs vs Fcalc
Given that the reproducibility of Fobs is typically < 3%, but typical R/Rfree values are in the 20%s, it is safe to say that this is a rather whopping systematic error. What causes it? Dunno. Would structural biologists benefit from being able to model it? Oh yes! Imagine being able to reliably see a ligand that has an occupancy of only 0.05, or to be able to unambiguously distinguish between two proposed reaction mechanisms and back up your claims with hard-core statistics (derived from SIGF). Perhaps even teasing apart all the different minor conformers occupied by the molecule in its functional cycle? I think this is the main reason why we all decided to archive Fobs: 20% error is a lot.

2) scale factors
We throw a lot of things into "scale factors", including sample absorption, shutter timing errors, radiation damage, flicker in the incident beam, vibrating crystals, phosphor thickness, point-spread vaiations, and many other phenomena. Do we understand the physics behind them? Yes (mostly). Is there "new biology" to be had by modelling them more accurately? No. Unless, of course, you count all the structures we have not solved yet.

Wouldn't it be nice if phasing from sulfur, phosphorous, chloride and other "native" elements actually worked? You wouldn't have to grow SeMet protein anymore, and you could go after systems that don't express well in E. coli. Perhaps even going to the native source! I think there is plenty of "new biology" to be had there. Wouldn't it be nice if you could do S-SAD even though your spots were all smeary and overlapped and mosaic and radiation damaged?

Why don't we do this now? Simple!: it doesn't work. Why doesn't it work? Because we don't know all the "scale factors" accurately enough. In most cases, the "% error" from all the scale factors usually adds up to ~3% (aka Rmerge, Rpim etc.), but the change in spot intensities due to native element anomalous scattering is usually less than 1%. Currently, the world record for smallest Bijvoet ratio is ~0.5% (Wang et al. 2006), but if photon-counting were the only source of error, we should be able to get Rmerge of ~0.1% or less, particularly in the low-angle resolution bins. If we can do that, then there will be little need for SeMet anymore.

But, we need the "raw" images if we are to have any hope of figuring out how to get the errors down to the 0.1% level. There is no one magic dataset that will tell us how to do this, we need to "average over" lots of them. Yes, this is further "upstream" of the "new biology" than deposited Fs, and yes the cost of archiving images is higher, but I think the potential benefits to the structural biology community if we can crack the 0.1% S-SAD barrier is nothing short of revolutionary.

-James Holton
MAD Scientist

----------
From: Graeme Winter

Hi Ed,

Ok, I'll bite: I would be very interested to see any data sets which
initially were thought to be e.g. PG222 and scale OK ish with that but
turn out in hindsight to be say PG2. Trying to automatically spot this
or at least warn inside xia2 would be really handy. Any
pseudosymmetric examples interesting.

Also any which are pseudocentred - index OK in C2 (say) but should
really be P2 (with the same cell) as the "missing" reflections are in
fact present but are just rather weaker due to NCS.

I have one example of each from the JCSG but more would be great,
especially in cases where the structure was solved & deposited.

There we go.

Now the matter of actually getting these here is slightly harder but
if anyone has an example I will work something out. Please get in
touch off-list... I will respond to the BB in a week or so to feed
back on how responses to this go :o)

Best wishes,

Graeme

----------
From: Loes Kroon-Batenburg

The problem is that in practice, errors in the data will arise from systematic errors as James Holton nicely sums up. It is hard to 'model' these. We really need many test cases to improve data processing techniques. An interesting suggestion mentioned along the thread was to create a data base of problematic data, i.e containing non-explained peaks, diffuse streaks and strange reflection profiles etc.
Needless to say, developers (and future PDB-REPROCESS initiatives) will benefit from raw data deposition. We will have to establish if it is worth the money and effort to deposit ALL raw data.

Loes.

--
__________________________________________

Dr. Loes Kroon-Batenburg
Dept. of Crystal and Structural Chemistry
Bijvoet Center for Biomolecular Research
Utrecht University
Padualaan 8, 3584 CH Utrecht
The Netherlands
__________________________________________

----------
From: James Holton

I tried looking for such "evil symmetry problem" examples some time ago, only to find that primitive monoclinic with a 90-degree beta angle is much more rare than one might think by looking at the PDB. About 1/3 of them are in the wrong space group.

Indeed, there are at least 366 PDB entries that claim "P2-ish", but POINTLESS thinks the space group of the deposited data is higher (PG222, C2, P6, etc.). Now, POINTLESS can be fooled by twinned data, but at least 286 of these entries do not mention twinning. Of these, 40 explicitly list NCS operators (not sure if the others used NCS?), and 35 of those were both solved by molecular replacement an explicitly say the free-R set was picked at random. These are:

Now, I'm sure there is an explanation for each and every one of these. But in the hands of a novice, such cases could easily result in a completely wrong structure giving a perfectly reasonable Rfree. This would happen if you started with, say, a wrong MR solution, but picked your random Rfree set in PG2 and then applied "NCS". Then each of your "free" hkls would actually be NCS-restrained to be the same as a member of the working set. However, I'm sure everyone who reads the CCP4BB already knew that. Perhaps because a discerning peer-reviewer, PDB annotator or some clever feature in our modern bullet-proof crystallographic software caught such a mistake for them. (Ahem)

Of course, what Graeme is asking for is the opposite of this: data that would appear as "nearly" PG222, but was actually lower symmetry. Unfortunately, there is no way to identify such cases from deposited Fs alone, as they will have been overmerged. In fact, I did once see a talk where someone managed to hammer an NCS 7-fold into a crystallographic 2-fold by doing some aggressive "outlier rejection" in scaling. Can't remember if that ever got published...

-James Holton
MAD Scientist

especially in cases where the structure was solved& deposited.

----------
From: Clemens Vonrhein

Hi James,

scary ... I was just looking at exactly the same thing (P21 with
beta~90), using the same tool (POINTLESS).

Currently I'm going through the structures for which images can be
found ... I haven't gone far through that list yet (in fact actually
only the first one), but this first case should indeed be in a higher
spacegroup (P 2 21 21).

As you say (and that's what Graeme looks for): finding 'over-merged'
datasets can be a bit more tricky ... once the damage is done. I have
the hunch that it might happen even more often though: we tend to
look for the highest symmetry that still gives a good indexing score,
right? Otherwise we would all go for P1 ...

Some other interesting groups for under-merging:

* orthorhombic with a==b or a==c or b==c (maybe tetragonal?)

* trigonal (P 3 etc) when it should be P 6

* monoclinic with beta==120

A few cases for each of those too ... all easy to check in
ftp://ftp.wwpdb.org/pub/pdb/derived_data/index/crystal.idx and then
(if structure factors are deposited) running POINTLESS on it (great
program Phil!).

Cheers

Clemens

* Sheraton House, Castle Park

* Cambridge CB3 0AX, UK

----------
From: Bryan Lepore

not sure I follow this thread, but this table might be interesting :

http://journals.iucr.org/d/issues/2010/05/00/dz5193/dz5193sup1.pdf

from:

Detection and correction of underassigned rotational symmetry prior to
structure deposition
B. K. Poon, R. W. Grosse-Kunstleve, P. H. Zwart and N. K. Sauter
Acta Cryst. (2010). D66, 503-513 [ doi:10.1107/S0907444910001502 ]

----------
From: Clemens Vonrhein

Oh yes, that is relevant and very interesting. As far as I understand
it, the detection of higher symmetry is based on the atomic
coordinates and not structure factors though (please correct me if I'm
wrong here).

At least some of the cases for which the deposited structure factors
strongly suggest a higher symmetry don't seem to be detected using
that papers approach (I can't find them listed in the supplemental).

Cheers

Clemens

----------
From: Felix Frolow

God bess the symmetry, we are saved from the over-interpreting symmetry (except probably of very exotic cases) by the very high Rsym factors around 40% 50% if the symmetry is wrong.

Even wild rejection of outliers, cannot reform "acceptable" Rmerge.

In my personal repository, 1QZV is a manifest of that. In 4.4 angstrom resolution, wrong interpretation of 90.2 angle monoclinic angle as 90 degrees orthorhombic supported by two molecules in the monoclinic asymmetric was corrected in the middle of the first data collection. Habitual on-fly processing of the data (integration and repetitive scaling after every several frames with HKL) detected that about half-through the data R factor in orthorhombic space group jumped to 40% from about 7%.

Reindexing solved the problem on the spot. I still keep the raw data.

Needless to say that before about a decade we would make precession photographs (I still own precession camera) and would not

make such a mistake.

Dr Felix Frolow

Professor of Structural Biology and Biotechnology

Department of Molecular Microbiology

and Biotechnology

Tel Aviv University 69978, Israel

Acta Crystallographica F, co-editor

----------
From: Felix Frolow

Clemens,

In the past, we have used TRACER (free domain) for higher symmetry or we interpreted manually Niggly values :-)

TRACER is gone long time ago. Niggly values are not displayed anymore, so we trust auto indexing of DENZO which, assuming all experimental parameters are properly set ( we do this by using a standard crystal such as lysozyme) is extremely sensitive in defining Bravais system. I have no experience with POINTLESS, but assume that it is also doing an excellent work.

----------
From: <mjvdwoerd

Reluctantly I am going to add my 2 cents to the discussion, with various aspects in one e-mail.

- It is easy to overlook that our "business" is to answer biological/biochemical questions. This is what you (generally) get grants for to do (showing that these questions are of critical importance in your ability to do science). Crystallography is one tool that we use to acquire evidence to answer questions. The time that you could get a Nobel prize for doing a structure or a PhD for doing a structure is gone. Even writing a publication with just a structure is now not as common anymore as it used to be. So the "biochemistry" drives crystallography. It is not reasonable to say that once you have collected data and you don't publish the data for 5 years, you are no longer interested. What that generally means is that "the rest of science" is not cooperating. In short: I would be against a strict rule for mandatory deposition of raw data, even after a long time. An example: I have data sets here with low resolution data (~10A) presumably of protein structures that have known structures for prokaryotes, but not for eukaryotes and it would be exciting if we could prove (or disprove) that they look the same. The problem, apart from resolution, is that the spots are so few and fuzzy that I cannot index the images. The main reason why I save the images is that if/when someone comes to me to say that they think they have made better crystals, we have something to compare. (Thanks to Gerard B. for encouragement to write this item :-)

- For those that think that we have come to the end of development in crystallography, James Holton (thank you) has described nicely why we should not think this. We are all happy if our model generates an R-factor of 20%. Even small molecule crystallographers would wave that away in an instant as inadequate. However, "everybody" has come to accept that this is fine for protein crystallography. It would be better if our models were more consistent with the experimental data. How could we make such models without access to lots of data? As a student I was always taught (when asking why 20% is actually "good") that we don't (for example) model solvent. Why not? It is not easy. If we did, would the 20% go down to 3%? I am guessing not, there are other errors that come into play.

- Gerard K. has eloquently spoken about cost and effort. Since I maintain a small (local) archive of images, I can affirm his words: a large-capacity disk is inexpensive ($100). A box that the disk sits in is inexpensive ($1000). A second box that sits in a different building, away for security reasons) that holds the backup, is inexpensive ($1400, with 4 disks). The infrastructure to run these boxes (power, fiber optics, boxes in between) is slightly more expensive. What is *really* expensive is people maintaining everything. It was a huge surprise to me (and my boss) how much time and effort it takes to annotate all data sets, rename them appropriately and file them away in a logical place so that anyone (who understands the scheme) can find them back. Therefore (!) the reason why this should be centralized is that the cost per data set stored goes down - it is more efficient. One person can process several (many, if largely automated) data sets per day. It is also of interest that we locally (2-5 people for a project) may not agree on what exactly should be stored. Therefore there is no hope that we can find consensus in the world, but we CAN get a reasonably compromise. But it is tough: I have heard the argument that data for published structures should be kept in case someone wants to see and/or go back, while I have also heard the argument that once published it is signed, sealed and delivered and it can go, while UNpublished data should be preserved because eventually it hopefully will get to publication. Each argument is reasonably sensible, but the conclusions are opposite. (I maintain both classes of data sets.)

- Granting agencies in the US generally require that you archive scientific data. What is not yet clear is whether they would be willing to pay for a centralized facility that would do that. After all, it is more exciting to NIH to give money for the study of a disease than it is to store data. But if the argument were made that each grant(ee) would be more efficient and could apply more money towards the actual problem, this might convince them. For that we would need a reasonable consensus what we want and why. More power to John. H and "The Committee".

Thanks to complete "silence" on the BB today I am finally caught up reading!

Mark van der Woerd

----------
From: Deacon, Ashley M.

All,

We have been following the CCP4BB discussion with interest. As has been mentioned on several occasions,
the JCSG has maintained, for several years now, an open archive of all diffraction datasets associated with
our deposited structures. Overall this has been a highly positive experience and many developers, researchers,
teachers and students have benefited from our archive. We currently have close to 100 registered users of our

archive and we seem to receive a new batch of users each time our archive is acknowledged in a paper or is
mentioned at a conference. Building on this initial success, we are currently extending our archive to include
unsolved datasets, which will help us more readily share data and collaborate with methods developers on some
of our less tractable datasets. We are also planning to include screening images for all crystals evaluated as part
of the JCSG pipeline (largely as a feedback tool to help improve crystal quality).

At JCSG, we benefit tremendously from our central database, which already tracks all required metadata associated
with any crystal. Thus I agree with other comments that the cost of such an undertaking should not be underestimated.
The cost of the hardware may be modest; however, people and resources are needed to develop and maintain a robust
and reliable archive.

To date we have not assigned DOIs to our datasets, but we certainly feel this would be of value going forward and are
currently considering this option for our revised archive, which is currently in development.

If successful then this may form a good prototype system, which could be opened up to a broader community outside
of JCSG.

We (JCSG) have already shared much of our experiences with the IUCR working group and we would be happy to
participate
and contribute to any ongoing efforts.

Sincerely,
Ashley.Deacon

JCSG Structure Determination Core Leader