From: James Holton
Date: 7 November 2011 17:30
At the risk of sounding like another "poll", I have a pragmatic question for the methods development community:
Hypothetically, assume that there was a website where you could download the original diffraction images corresponding to any given PDB file, including "early" datasets that were from the same project, but because of smeary spots or whatever, couldn't be solved. There might even be datasets with "unknown" PDB IDs because that particular project never did work out, or because the relevant protein sequence has been lost. Remember, few of these datasets will be less than 5 years old if we try to allow enough time for the original data collector to either solve it or graduate (and then cease to care). Even for the "final" dataset, there will be a delay, since the half-life between data collection and coordinate deposition in the PDB is still ~20 months. Plenty of time to forget. So, although the images were archived (probably named "test" and in a directory called "john") it may be that the only way to figure out which PDB ID is the "right answer" is by processing them and comparing to all deposited Fs. Assume this was done. But there will always be some datasets that don't match any PDB. Are those interesting? What about ones that can't be processed? What about ones that can't even be indexed? There may be a lot of those! (hypothetically, of course).
Anyway, assume that someone did go through all the trouble to make these datasets "available" for download, just in case they are interesting, and annotated them as much as possible. There will be about 20 datasets for any given PDB ID.
Now assume that for each of these datasets this hypothetical website has two links, one for the "raw data", which will average ~2 GB per wedge (after gzip compression, taking at least ~45 min to download), and a second link for a "lossy compressed" version, which is only ~100 MB/wedge (2 min download). When decompressed, the images will visually look pretty much like the originals, and generally give you very similar Rmerge, Rcryst, Rfree, I/sigma, anomalous differences, and all other statistics when processed with contemporary software. Perhaps a bit worse. Essentially, lossy compression is equivalent to adding noise to the images.
Which one would you try first? Does lossy compression make it easier to hunt for "interesting" datasets? Or is it just too repugnant to have "modified" the data in any way shape or form ... after the detector manufacturer's software has "corrected" it? Would it suffice to simply supply a couple of "example" images for download instead?
-James Holton
MAD Scientist
----------
From: Herbert J. Bernstein
This is a very good question. I would suggest that both versions
of the old data are useful. If was is being done is simple validation
and regeneration of what was done before, then the lossy compression
should be fine in most instances. However, when what is being
done hinges on the really fine details -- looking for lost faint
spots just peeking out from the background, looking at detailed
peak profiles -- then the lossless compression version is the
better choice. The annotation for both sets should be the same.
The difference is in storage and network bandwidth.
Hopefully the fraud issue will never again rear its ugly head,
but if it should, then having saved the losslessly compressed
images might prove to have been a good idea.
To facilitate experimentation with the idea, if there is agreement
on the particular lossy compression to be used, I would be happy
to add it as an option in CBFlib. Right now all the compressions
we have are lossless.
Regards,
Herbert
==============================
=======================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
==============================
=======================
----------
From: James Holton
So far, all I really have is a "proof of concept" compression algorithm here:
http://bl831.als.lbl.gov/~jamesh/lossy_compression/ Not exactly "portable" since you need ffmpeg and the x264 libraries
set up properly. The latter seems to be constantly changing things
and breaking the former, so I'm not sure how "future proof" my
"algorithm" is.
Something that caught my eye recently was fractal compression,
particularly since FIASCO has been part of the NetPBM package for
about 10 years now. Seems to give comparable compression vs quality
as x264 (to my eye), but I'm presently wondering if I'd be wasting my
time developing this further? Will the crystallographic world simply
turn up its collective nose at lossy images? Even if it means waiting
6 years for "Nielsen's Law" to make up the difference in network
bandwidth?
-James Holton
MAD Scientist
----------
From: Herbert J. Bernstein
Dear James,
You are _not_ wasting your time. Even if the lossy compression ends
up only being used to stage preliminary images forward on the net while
full images slowly work their way forward, having such a compression
that preserves the crystallography in the image will be an important
contribution to efficient workflows. Personally I suspect that
such images will have more important, uses, e.g. facilitating
real-time monitoring of experiments using detectors providing
full images at data rates that simply cannot be handled without
major compression. We are already in that world. The reason that
the Dectris images use Andy Hammersley's byte-offset compression,
rather than going uncompressed or using CCP4 compression is that
in January 2007 we were sitting right on the edge of a nasty CPU-performance/disk bandwidth tradeoff, and the byte-offset
compression won the competition. In that round a lossless
compression was sufficient, but just barely. In the future,
I am certain some amount of lossy compression will be
needed to sample the dataflow while the losslessly compressed
images work their way through a very back-logged queue to the disk.
In the longer term, I can see people working with lossy compressed
images for analysis of massive volumes of images to select the
1% to 10% that will be useful in a final analysis, and may need
to be used in a lossless mode. If you can reject 90% of the images
with a fraction of the effort needed to work with the resulting
10% of good images, you have made a good decision.
An then there is the inevitable need to work with images on
portable devices with limited storage over cell and WIFI networks. ...
I would not worry about upturned noses. I would worry about
the engineering needed to manage experiments. Lossy compression
can be an important part of that engineering.
Regards,
Herbert
--
Dowling College, Brookhaven Campus, B111B
1300 William Floyd Parkway, Shirley, NY, 11967
==============================
=======================
----------
From: Frank von Delft
I'll second that... can't remember anybody on the barricades about "corrected" CCD images, but they've been just so much more practical.
Different kind of problem, I know, but equivalent situation: the people to ask are not the purists, but the ones struggling with the huge volumes of data. I'll take the lossy version any day if it speeds up real-time evaluation of data quality, helps me browse my datasets, and allows me to do remote but intelligent data collection.
phx.
----------
From: Miguel Ortiz Lombardia
So the purists of speed seem to be more relevant than the purists of images.
We complain all the time about how many errors we have out there in our
experiments that we seemingly cannot account for. Yet, would we add
another source?
Sorry if I'm missing something serious here, but I cannot understand
this artificial debate. You can do useful remote data collection without
having look at *each* image.
Miguel
--
Miguel
----------
From: Jan Dohnalek
I think that real universal image depositions will not take off without a newish type of compression that will speed up and ease up things.
Therefore the compression discussion is highly relevant - I would even suggest to go to mathematicians and software engineers to provide
a highly efficient compression format for our type of data - our data sets have some very typical repetitive features so they can be very likely compressed as a whole set without loosing information (differential compression in the series) but this needs experts ..
Jan Dohnalek
--
Jan Dohnalek, Ph.D
Institute of Macromolecular Chemistry
Academy of Sciences of the Czech Republic
Heyrovskeho nam. 2
16206 Praha 6
Czech Republic
----------
From: Graeme Winter
HI James,
Regarding the suggestion of lossy compression, it is really hard to
comment without having a good idea of the real cost of doing this. So,
I have a suggestion:
- grab a bag of JCSG data sets, which we know should all be essentially OK.
- you squash then unsquash them with your macguffin, perhaps
randomizing them as to whether A or B is squashed.
- process them with Elves / xia2 / autoPROC (something which is reproducible)
- pop the results into pdb_redo
Then compare the what-comes-out. Ultimately adding "noise" may (or may
not) make a measurable difference to the final refinement - this may
be a way of telling if it does or doesn't. Why however would I have
any reason to worry? Because the noise being added is not really
random - it will compression artifacts. This could have a subtle
effect on how the errors are estimated and so on. However you can hum
and haw about this for a decade without reaching a conclusion.
Here, it's something which in all honesty we can actually evaluate, so
is it worth giving it a go? If the results were / are persuasive (i.e.
a "report on the use of lossy compression in transmission and storage
of X-ray diffraction data" was actually read and endorsed by the
community) this would make it much more worthwhile for consideration
for inclusion in e.g. cbflib.
I would however always encourage (if possible) that the original raw
data is kept somewhere on disk in an unmodified form - I am not a fan
of one-way computational processes with unique data.
Thoughts anyone?
Cheerio,
Graeme
----------
From: Kay Diederichs
Hi James,
I see no real need for lossy compression datasets. They may be useful for demonstration purposes, and to follow synchrotron data collection remotely. But for processing I need the real data. It is my experience that structure solution, at least in the difficult cases, depends on squeezing out every bit of scattering information from the data, as much as is possible with the given software. Using a lossy-compression dataset in this situation would give me the feeling "if structure solution does not work out, I'll have to re-do everything with the original data" - and that would be double work. Better not start going down that route.
The CBF byte compression puts even a 20bit detector pixel into a single byte, on average. These frames can be further compressed, in the case of Pilatus fine-slicing frames, using bzip2, almost down to the level of entropy in the data (since there are so many zero pixels). And that would be lossless.
Storing lossily-compressed datasets would of course not double the diskspace needed, but would significantly raise the administrative burdens.
Just to point out my standpoint in this whole discussion about storage of raw data:
I've been storing our synchrotron datasets on disks, since 1999. The amount of money we spend per year for this purpose is constant (less than 1000€). This is possible because the price of a GB disk space drops faster than the amount of data per synchrotron trip rises. So if the current storage is full (about every 3 years), we set up a bigger RAID (plus a backup RAID); the old data, after copying over, always consumes only a fraction of the space on the new RAID.
So I think the storage cost is actually not the real issue - rather, the real issue has a strong psychological component. People a) may not realize that the software they use is constantly being improved, and that needs data which cover all the corner cases; b) often do not wish to give away something because they feel it might help their competitors, or expose their faults.
best,
Kay (XDS co-developer)
----------
From: Harry Powell
Hi
I agree.
Harry
--
Dr Harry Powell, MRC Laboratory of Molecular Biology, MRC Centre, Hills Road, Cambridge, CB2 0QH
http://www.iucr.org/resources/commissions/crystallographic-computing/schools/mieres2011
----------
From: Miguel Ortiz Lombardía
Le 08/11/11 10:15, Kay Diederichs a écrit :
Hi Kay and others,
I completely agree with you.
Datalove, <3
:-)
----------
From: Herbert J. Bernstein
Um, but isn't Crystallograpy based on a series of
one-way computational processes:
photons -> images
images -> {struture factors, symmetry}
{structure factors, symmetry, chemistry} -> solution
{structure factors, symmetry, chemistry, solution}
-> refined solution
At each stage we tolerate a certain amount of noise
in "going backwards". Certainly it is desirable to
have the "original data" to be able to go forwards,
but until the arrival of pixel array detectors, we
were very far from having the true original data,
and even pixel array detectors don't capture every
single photon.
I am not recommending lossy compressed images as
a perfect replacement for lossless compressed images,
any more than I would recommend structure factors
are a replacement for images. It would be nice
if we all had large budgets, huge storage capacity
and high network speeds and if somebody would repeal
the speed of light and other physical constraints, so that
engineering compromises were never necessary, but as
James has noted, accepting such engineering compromises
has been of great value to our colleagues who work
with the massive image streams of the entertainment
industry. Without lossy compression, we would not
have the _higher_ image quality we now enjoy in the
less-than-perfectly-faithful HDTV world that has replaced
the highly faithful, but lower capacity, NTSC/PAL world.
Please, in this, let us not allow the perfect to be
the enemy of the good. James is proposing something
good. Professor of Mathematics and Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
=====================================================
----------
From: Graeme Winter
Dear Herbert,
Sorry, the point I was getting at was that the process is one way, but
if it is also *destructive* i.e. the original "master" is not
available then I would not be happy. If the master copy of what was
actually recorded is available from a tape someplace perhaps not all
that quickly then to my mind that's fine.
When we go from images to intensities, the images still exist. And by
and large the intensities are useful enough that you don't go back to
the images again. This is worth investigating I believe, which is why
I made that proposal.
Mostly I listen to mp3's as they're convenient, but I still buy CD's
not direct off e.g. itunes, and yes a H264 compressed video stream is
much nicer to watch than VHS.
Best wishes,
Graeme
On 8 November 2011 12:17, Herbert J. Bernstein
----------
From: James Holton
At the risk of putting this thread back on-topic, my original question was not "should I just lossfully compress my images and throw away the originals". My question was:
"would you download the compressed images first?"
So far, noone has really answered it.
I think it is obvious that of course we would RATHER have the original data, but if access to the original data is "slow" (by a factor of 30 at best) then can the "mp3 version" of diffraction data play a useful role in YOUR work?
Taking Graeme's request from a different thread as an example, he would like to see stuff in P21 with a 90 degree beta angle. There are currently ~609 examples of this in the PDB. So, I ask again: "which one would you download first?". 1aip? (It is first alphabetically). Then again, if you just email the corresponding authors of all 609 papers, the response rate alone might whittle the number of datasets to deal with down to less than 10. Perhaps even less than 1.
-James Holton
MAD Scientist----------
From: Graeme Winter
Hi James,
Fair enough.
However I would still be quite interested to see how different the
results are from the originals and the compressed versions. If the
differences were pretty minor (i.e. not really noticeable) then I
would certainly have a good look at the mp3 version.
Also it would make my data storage situation a little easier, at least
what I use for routine testing. Worth a go?
Cheerio,
Graeme
----------
From: Miguel Ortiz Lombardia
Le 08/11/2011 19:19, James Holton a écrit :
Hmm, I thought I had been clear. I will try to be more direct:
Given the option, I would *only* download the original,
non-lossy-compressed data. At the expense of time, yes. I don't think
Graeme's example is very representative of our work, sorry.
As long as the option between the two is warranted, I don't care. I just
don't see the point for the very same reasons Kay has very clearly exposed.
Best regards,
----------
From: <mjvdwoerd
Hmmm, so you would, when collecting large data images, say 4 images, 100MB in size, per second, in the middle of the night, from home, reject seeing compressed images on your data collection software, while the "real thing" is lingering behind somewhere, to be downloaded and stored later? As opposed to not seeing the images (because your home internet access cannot keep up) and only inspecting 1 in a 100 images to see progress?
I think there are instances where compressed (lossy or not) images will be invaluable. I know the above situation was not the context, but (y'all may gasp about this) I still have some friends (in the US) who live so far out in the wilderness that only dial-up internet is available. That while synchrotrons and the detectors used get better all the time, which means more MB/s produced.
James has already said (and I agree) that the original images (with all information) should not necessarily be thrown away. Perhaps a better question would be "which would you use for what purpose", since I am convinced that compressed images are useful.
I would want to process the "real thing", unless I have been shown by scientific evidence that the compressed thing works equally well. It seems reasonable to assume that such evidence can be acquired and/or that we can be shown by evidence what we gain and lose by lossy-compressed images. Key might be to be able to choose the best thing for your particular application/case/location etc.
So yes, James, of course this is useful and not a waste of time.
Mark
----------
From: Miguel Ortiz Lombardia
Le 08/11/2011 20:46,
mjvdwoerd@netscape.net a écrit :
1. I don't need to *see* all images to verify whether the collection is
going all right. If I collect remotely, I process remotely, no need to
transfer images. Data is collected so fast today that you may, even
while collecting at the synchrotron, finish the collection without a)
seeing actually all the images (cf. Pilatus detectors) b) keeping in
pace at all your data processing. The crystal died or was not collected
properly? You try to understand why, you recollect it if possible or you
try a new crystal. It's been always like this, it's call trial and error.
2. The ESRF in Grenoble produces thumbnails of the images. If all you
want to see is whether there is diffraction, they are good enough and
they are useful. They are extremely lossy and useless for anything else.
3. Please, compare contemporary facts. Today's bandwidth is what it is,
today's images are *not* 100 Mb (yet). When they get there, let us know
what is the bandwidth.
I would understand a situation like the one you describe for a poor, or
an embargoed country where unfortunately there is no other way to
connect to a synchrotron. Still, that should be solved by the community
in a different way: by gracious cooperation with our colleagues in those
countries. Your example is actually quite upsetting, given the current
state of affairs in the world.
I think I was clear: as long as we have access to the original data, I
don't care. I would only use the original data.
This still assumes that future software will not be able to detect the
differences that you cannot see today. This may or may not be true, the
consequences may or may not be important. But there is, I think,
reasonable doubt on both questions.
I have said to James, off the list, that he should go on if he's
convinced about the usefulness of his approach. For a very scientific
reason: I could be wrong. Yet, if need be to go into the compression
path, I think we should prefer lossless options.
----------
From: Phil Evans
It would be a good start to get all images written now with lossless compression, instead of the uncompressed images we still get from the ADSC detectors. Something that we've been promised for many years
Phil
----------
From: Herbert J. Bernstein
ADSC has been a leader in supporting compressed CBF's.
=====================================================
Herbert J. Bernstein
Professor of Mathematics and Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
=====================================================
----------
From: William G. Scott
The mp3/music analogy might be quite appropriate.
On some commercial music download sites, there are several options for purchase, ranging from audiophool-grade 24-bit, 192kHz sampled music, to CD-quality (16-bit, 44.1kHz), to mp3 compression and various lossy bit-rates. I am told that the resampling and compression is actually done on the fly by the server, from a single master, and the purchaser chooses what files to download based on cost, ability to play high-res data, degree of canine-like hearing, intolerance for lossy compression with its limited dynamic range, etc.
Perhaps that would be the best way to handle it from a central repository, allowing the end-user to decide on the fly. The lossless files could somehow be tagged as such, to avoid confusion.
Bill
William G. Scott
Professor
Department of Chemistry and Biochemistry
USA