From: James Holton 
Date: 7 November 2011 17:30
At the risk of sounding like another "poll", I have a pragmatic question for the methods development community:  
 Hypothetically, assume that there was a website where you could download the original diffraction images corresponding to any given PDB file, including "early" datasets that were from the same project, but because of smeary spots or whatever, couldn't be solved.  There might even be datasets with "unknown" PDB IDs because that particular project never did work out, or because the relevant protein sequence has been lost.  Remember, few of these datasets will be less than 5 years old if we try to allow enough time for the original data collector to either solve it or graduate (and then cease to care).  Even for the "final" dataset, there will be a delay, since the half-life between data collection and coordinate deposition in the PDB is still ~20 months.  Plenty of time to forget.  So, although the images were archived (probably named "test" and in a directory called "john") it may be that the only way to figure out which PDB ID is the "right answer" is by processing them and comparing to all deposited Fs.  Assume this was done.  But there will always be some datasets that don't match any PDB.  Are those interesting?  What about ones that can't be processed?  What about ones that can't even be indexed?  There may be a lot of those!  (hypothetically, of course).  
 Anyway, assume that someone did go through all the trouble to make these datasets "available" for download, just in case they are interesting, and annotated them as much as possible.  There will be about 20 datasets for any given PDB ID.  
 Now assume that for each of these datasets this hypothetical website has two links, one for the "raw data", which will average ~2 GB per wedge (after gzip compression, taking at least ~45 min to download), and a second link for a "lossy compressed" version, which is only ~100 MB/wedge (2 min download).  When decompressed, the images will visually look pretty much like the originals, and generally give you very similar Rmerge, Rcryst, Rfree, I/sigma, anomalous differences, and all other statistics when processed with contemporary software.  Perhaps a bit worse.  Essentially, lossy compression is equivalent to adding noise to the images.  
 Which one would you try first?  Does lossy compression make it easier to hunt for "interesting" datasets?  Or is it just too repugnant to have "modified" the data in any way shape or form ... after the detector manufacturer's software has "corrected" it?  Would it suffice to simply supply a couple of "example" images for download instead?
  
 -James Holton
 MAD Scientist
 ----------
From: Herbert J. Bernstein 
This is a very good question.  I would suggest that both versions
  of the old data are useful.  If was is being done is simple validation
 and regeneration of what was done before, then the lossy compression
 should be fine in most instances.  However, when what is being
 done hinges on the really fine details -- looking for lost faint
 spots just peeking out from the background, looking at detailed
 peak profiles -- then the lossless compression version is the
 better choice.  The annotation for both sets should be the same.
 The difference is in storage and network bandwidth. 
 Hopefully the fraud issue will never again rear its ugly head,
 but if it should, then having saved the losslessly compressed
 images might prove to have been a good idea. 
 To facilitate experimentation with the idea, if there is agreement
 on the particular lossy compression to be used, I would be happy
 to add it as an option in CBFlib.  Right now all the compressions
 we have are lossless. 
 Regards,
   Herbert  
 ==============================
=======================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769
 ==============================
=======================
----------
From: James Holton
So far, all I really have is a "proof of concept" compression algorithm here:  
http://bl831.als.lbl.gov/~jamesh/lossy_compression/  Not exactly "portable" since you need ffmpeg and the x264 libraries
 set up properly.  The latter seems to be constantly changing things
 and breaking the former, so I'm not sure how "future proof" my
 "algorithm" is. 
 Something that caught my eye recently was fractal compression,
 particularly since FIASCO has been part of the NetPBM package for
 about 10 years now.  Seems to give comparable compression vs quality
 as x264 (to my eye), but I'm presently wondering if I'd be wasting my
 time developing this further?  Will the crystallographic world simply
 turn up its collective nose at lossy images?  Even if it means waiting
 6 years for "Nielsen's Law" to make up the difference in network
 bandwidth? 
 -James Holton
 MAD Scientist
 ----------
From: Herbert J. Bernstein 
Dear James, 
   You are _not_ wasting your time.  Even if the lossy compression ends
 up only being used to stage preliminary images forward on the net while
 full images slowly work their way forward, having such a compression
 that preserves the crystallography in the image will be an important
 contribution to efficient workflows.  Personally I suspect that
 such images will have more important, uses, e.g. facilitating
 real-time monitoring of experiments using detectors providing
 full images at data rates that simply cannot be handled without
 major compression.  We are already in that world.  The reason that
 the Dectris images use Andy Hammersley's byte-offset compression,
 rather than going uncompressed or using CCP4 compression is that
 in January 2007 we were sitting right on the edge of a nasty CPU-performance/disk bandwidth tradeoff, and the byte-offset
 compression won the competition.   In that round a lossless
 compression was sufficient, but just barely.  In the future,
 I am certain some amount of lossy compression will be
 needed to sample the dataflow while the losslessly compressed
 images work their way through a very back-logged queue to the disk. 
   In the longer term, I can see people working with lossy compressed
 images for analysis of massive volumes of images to select the
 1% to 10% that will be useful in a final analysis, and may need
 to be used in a lossless mode.  If you can reject 90% of the images
 with a fraction of the effort needed to work with the resulting
 10% of good images, you have made a good decision. 
   An then there is the inevitable need to work with images on
 portable devices with limited storage over cell and WIFI networks. ... 
   I would not worry about upturned noses.  I would worry about
 the engineering needed to manage experiments.  Lossy compression
 can be an important part of that engineering. 
   Regards,
     Herbert
 -- 
      Dowling College, Brookhaven Campus, B111B
    1300 William Floyd Parkway, Shirley, NY, 11967
 ==============================
======================= 
----------
From: Frank von Delft
I'll second that...  can't remember anybody on the barricades about "corrected" CCD images, but they've been just so much more practical.  
 Different kind of problem, I know, but equivalent situation:  the people to ask are not the purists, but the ones struggling with the huge volumes of data.  I'll take the lossy version any day if it speeds up real-time evaluation of data quality, helps me browse my datasets, and allows me to do remote but intelligent data collection.  
 phx.
----------
From: Miguel Ortiz Lombardia
So the purists of speed seem to be more relevant than the purists of images.  
 We complain all the time about how many errors we have out there in our
 experiments that we seemingly cannot account for. Yet, would we add
 another source? 
 Sorry if I'm missing something serious here, but I cannot understand
 this artificial debate. You can do useful remote data collection without
 having look at *each* image.  
 Miguel  
--
 Miguel
 ----------
From: Jan Dohnalek
I think that real universal image depositions will not take off without a newish type of compression that will speed up and ease up things.
 Therefore the compression discussion is highly relevant - I would even suggest to go to mathematicians and software engineers to provide
 a highly efficient compression format for our type of data - our data sets have some very typical repetitive features so they can be very likely compressed as a whole set without loosing information (differential compression in the series) but this needs experts ..  
Jan Dohnalek
-- 
Jan Dohnalek, Ph.D
Institute of Macromolecular Chemistry
Academy of Sciences of the Czech Republic
Heyrovskeho nam. 2
 16206 Praha 6
Czech Republic
 ----------
From: Graeme Winter
HI James, 
 Regarding the suggestion of lossy compression, it is really hard to
 comment without having a good idea of the real cost of doing this. So,
 I have a suggestion: 
  - grab a bag of JCSG data sets, which we know should all be essentially OK.
  - you squash then unsquash them with your macguffin, perhaps
 randomizing them as to whether A or B is squashed.
  - process them with Elves / xia2 / autoPROC (something which is reproducible)
  - pop the results into pdb_redo 
 Then compare the what-comes-out. Ultimately adding "noise" may (or may
 not) make a measurable difference to the final refinement - this may
 be a way of telling if it does or doesn't. Why however would I have
 any reason to worry? Because the noise being added is not really
 random - it will compression artifacts. This could have a subtle
 effect on how the errors are estimated and so on. However you can hum
 and haw about this for a decade without reaching a conclusion. 
 Here, it's something which in all honesty we can actually evaluate, so
 is it worth giving it a go? If the results were / are persuasive (i.e.
 a "report on the use of lossy compression in transmission and storage
 of X-ray diffraction data" was actually read and endorsed by the
 community) this would make it much more worthwhile for consideration
 for inclusion in e.g. cbflib. 
 I would however always encourage (if possible) that the original raw
 data is kept somewhere on disk in an unmodified form - I am not a fan
 of one-way computational processes with unique data. 
 Thoughts anyone? 
 Cheerio, 
 Graeme 
----------
From: Kay Diederichs 
Hi James, 
 I see no real need for lossy compression datasets. They may be useful for demonstration purposes, and to follow synchrotron data collection remotely. But for processing I need the real data. It is my experience that structure solution, at least in the difficult cases, depends on squeezing out every bit of scattering information from the data, as much as is possible with the given software. Using a lossy-compression dataset in this situation would give me the feeling "if structure solution does not work out, I'll have to re-do everything with the original data" - and that would be double work. Better not start going down that route.  
 The CBF byte compression puts even a 20bit detector pixel into a single byte, on average. These frames can be further compressed, in the case of Pilatus fine-slicing frames, using bzip2, almost down to the level of entropy in the data (since there are so many zero pixels). And that would be lossless.  
 Storing lossily-compressed datasets would of course not double the diskspace needed, but would significantly raise the administrative burdens. 
 Just to point out my standpoint in this whole discussion about storage of raw data:
 I've been storing our synchrotron datasets on disks, since 1999. The amount of money we spend per year for this purpose is constant (less than 1000€). This is possible because the price of a GB disk space drops faster than the amount of data per synchrotron trip rises. So if the current storage is full (about every 3 years), we set up a bigger RAID (plus a backup RAID); the old data, after copying over, always consumes only a fraction of the space on the new RAID.  
 So I think the storage cost is actually not the real issue - rather, the real issue has a strong psychological component. People a) may not realize that the software they use is constantly being improved, and that needs data which cover all the corner cases; b) often do not wish to give away something because they feel it might help their competitors, or expose their faults.  
 best, 
 Kay (XDS co-developer)    
----------
From: Harry Powell 
Hi
 I agree. 
 Harry
 --
 Dr Harry Powell, MRC Laboratory of Molecular Biology, MRC Centre, Hills Road, Cambridge, CB2 0QH
 
 http://www.iucr.org/resources/commissions/crystallographic-computing/schools/mieres2011
 ----------
From: Miguel Ortiz LombardĂa 
Le 08/11/11 10:15, Kay Diederichs a écrit : 
Hi Kay and others, 
 I completely agree with you. 
 Datalove, <3
 :-) 
----------
From: Herbert J. Bernstein
Um, but isn't Crystallograpy based on a series of
  one-way computational processes:
      photons -> images
      images -> {struture factors, symmetry}
  {structure factors, symmetry, chemistry} -> solution
  {structure factors, symmetry, chemistry, solution}
       -> refined solution 
 At each stage we tolerate a certain amount of noise
 in "going backwards".  Certainly it is desirable to
 have the "original data" to be able to go forwards,
 but until the arrival of pixel array detectors, we
 were very far from having the true original data,
 and even pixel array detectors don't capture every
 single photon. 
 I am not recommending lossy compressed images as
 a perfect replacement for lossless compressed images,
 any more than I would recommend structure factors
 are a replacement for images.  It would be nice
 if we all had large budgets, huge storage capacity
 and high network speeds and if somebody would repeal
 the speed of light and other physical constraints, so that
 engineering compromises were never necessary, but as
 James has noted, accepting such engineering compromises
 has been of great value to our colleagues who work
 with the massive image streams of the entertainment
 industry.  Without lossy compression, we would not
 have the _higher_ image quality we now enjoy in the
 less-than-perfectly-faithful HDTV world that has replaced
 the highly faithful, but lower capacity, NTSC/PAL world. 
 Please, in this, let us not allow the perfect to be
 the enemy of the good.  James is proposing something
 good.     Professor of Mathematics and Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769
 =====================================================
 
----------
From: Graeme Winter
Dear Herbert, 
 Sorry, the point I was getting at was that the process is one way, but
 if it is also *destructive* i.e. the original "master" is not
 available then I would not be happy. If the master copy of what was
 actually recorded is available from a tape someplace perhaps not all
 that quickly then to my mind that's fine. 
 When we go from images to intensities, the images still exist. And by
 and large the intensities are useful enough that you don't go back to
 the images again. This is worth investigating I believe, which is why
 I made that proposal. 
 Mostly I listen to mp3's as they're convenient, but I still buy CD's
 not direct off e.g. itunes, and yes a H264 compressed video stream is
 much nicer to watch than VHS. 
 Best wishes, 
 Graeme 
 On 8 November 2011 12:17, Herbert J. Bernstein 
----------
From: James Holton 
At the risk of putting this thread back on-topic, my original question was not "should I just lossfully compress my images and throw away the originals".  My question was:  
  "would you download the compressed images first?" 
 So far, noone has really answered it. 
 I think it is obvious that of course we would RATHER have the original data, but if access to the original data is "slow" (by a factor of 30 at best) then can the "mp3 version" of diffraction data play a useful role in YOUR work?  
 Taking Graeme's request from a different thread as an example, he would like to see stuff in P21 with a 90 degree beta angle.  There are currently ~609 examples of this in the PDB.  So, I ask again: "which one would you download first?".  1aip? (It is first alphabetically).  Then again, if you just email the corresponding authors of all 609 papers, the response rate alone might whittle the number of datasets to deal with down to less than 10.  Perhaps even less than 1.
  
 -James Holton
 MAD Scientist----------
From: Graeme Winter 
Hi James, 
 Fair enough. 
 However I would still be quite interested to see how different the
 results are from the originals and the compressed versions. If the
 differences were pretty minor (i.e. not really noticeable) then I
 would certainly have a good look at the mp3 version. 
 Also it would make my data storage situation a little easier, at least
 what I use for routine testing. Worth a go? 
 Cheerio, 
 Graeme 
----------
From: Miguel Ortiz Lombardia 
Le 08/11/2011 19:19, James Holton a écrit :
 Hmm, I thought I had been clear. I will try to be more direct: 
 Given the option, I would *only* download the original,
 non-lossy-compressed data. At the expense of time, yes. I don't think
 Graeme's example is very representative of our work, sorry. 
 As long as the option between the two is warranted, I don't care. I just
 don't see the point for the very same reasons Kay has very clearly exposed. 
 Best regards, 
----------
From:  <mjvdwoerd
   Hmmm, so you would, when collecting large data images, say 4 images, 100MB in size, per second, in the middle of the night, from home, reject seeing compressed images on your data collection software, while the "real thing" is lingering behind somewhere, to be downloaded and stored later? As opposed to not seeing the images (because your home internet access cannot keep up) and only inspecting 1 in a 100 images to see progress?
  
 I think there are instances where compressed (lossy or not) images will be invaluable. I know the above situation was not the context, but (y'all may gasp about this) I still have some friends (in the US) who live so far out in the wilderness that only dial-up internet is available. That while synchrotrons and the detectors used get better all the time, which means more MB/s produced. 
  
 James has already said (and I agree) that the original images (with all information) should not necessarily be thrown away. Perhaps a better question would be "which would you use for what purpose", since I am convinced that compressed images are useful. 
  
 I would want to process the "real thing", unless I have been shown by scientific evidence that the compressed thing works equally well. It seems reasonable to assume that such evidence can be acquired and/or that we can be shown by evidence what we gain and lose by lossy-compressed images. Key might be to be able to choose the best thing for your particular application/case/location etc. 
  
 So yes, James, of course this is useful and not a waste of time.
 
 Mark
  
----------
From: Miguel Ortiz Lombardia
Le 08/11/2011 20:46, 
mjvdwoerd@netscape.net a écrit :
  1. I don't need to *see* all images to verify whether the collection is
 going all right. If I collect remotely, I process remotely, no need to
 transfer images. Data is collected so fast today that you may, even
 while collecting at the synchrotron, finish the collection without a)
 seeing actually all the images (cf. Pilatus detectors) b) keeping in
 pace at all your data processing. The crystal died or was not collected
 properly? You try to understand why, you recollect it if possible or you
 try a new crystal. It's been always like this, it's call trial and error. 
 2. The ESRF in Grenoble produces thumbnails of the images. If all you
 want to see is whether there is diffraction, they are good enough and
 they are useful. They are extremely lossy and useless for anything else. 
 3. Please, compare contemporary facts. Today's bandwidth is what it is,
 today's images are *not* 100 Mb (yet). When they get there, let us know
 what is the bandwidth.
 I would understand a situation like the one you describe for a poor, or
 an embargoed country where unfortunately there is no other way to
 connect to a synchrotron. Still, that should be solved by the community
 in a different way: by gracious cooperation with our colleagues in those
 countries. Your example is actually quite upsetting, given the current
 state of affairs in the world.
 I think I was clear: as long as we have access to the original data, I
 don't care. I would only use the original data.
 This still assumes that future software will not be able to detect the
 differences that you cannot see today. This may or may not be true, the
 consequences may or may not be important. But there is, I think,
 reasonable doubt on both questions.
 I have said to James, off the list, that he should go on if he's
 convinced about the usefulness of his approach. For a very scientific
 reason: I could be wrong. Yet, if need be to go into the compression
 path, I think we should prefer lossless options. 
----------
From: Phil Evans 
It would be a good start to get all images written now with lossless compression, instead of the uncompressed images we still get from the ADSC detectors. Something that we've been promised for many years  
 Phil
 ----------
From: Herbert J. Bernstein 
ADSC has been a leader in supporting compressed CBF's.
 
 =====================================================
               Herbert J. Bernstein
     Professor of Mathematics and Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769
 =====================================================
 
----------
From: William G. Scott 
The mp3/music analogy might be quite appropriate.  
 On some commercial music download sites, there are several options for purchase, ranging from audiophool-grade 24-bit, 192kHz sampled music, to CD-quality (16-bit, 44.1kHz), to mp3 compression and various lossy bit-rates.  I am told that the resampling and compression is actually done on the fly by the server, from a single master, and the purchaser chooses what files to download based on cost, ability to play high-res data, degree of canine-like hearing, intolerance for lossy compression with its limited dynamic range, etc.  
 Perhaps that would be the best way to handle it from a central repository, allowing the end-user to decide on the fly. The lossless files could somehow be tagged as such, to avoid confusion.  
 Bill    
 William G. Scott
 Professor
 Department of Chemistry and Biochemistry
 USA