From: Ed Pozharski
Date: 8 November 2011 17:29
Could anyone point me towards instructions on how to get/build
parallelized phaser binary on linux? I searched around but so far found
nothing. The latest updated phaser binary doesn't seem to be
parallelized.
Apologies if this has been resolved before - just point at the relevant
thread, please.
--
"I'd jump in myself, if I weren't so good at whistling."
Julian, King of Lemurs
----------
From: Dr G. Bunkoczi
Hi Ed,
in the CCP4 distribution, openmp is not enabled by default, and there
seems to be no easy way to enable it (i.e. by setting a flag at the
configure stage).
On the other hand, you can easily create a separate build for phaser
that is openmp enabled and use phaser from there. To do this, create a
new folder, say "phaser-build", cd into it, and issue the following
commands (this assumes you are using bash):
$ python $CCP4/lib/cctbx/cctbx_sources/cctbx_project/libtbx/configure.py
--repository=$CCP4/src/phaser/source phaser
--build-boost-python-extensions=False --enable-openmp-if-possible=True
$ . ./setpaths.sh ("source ./setpaths.csh" with csh) $ libtbx.scons (if you have several CPUs, add -jX where X is the number of CPUs you want to use for compilation)
This will build phaser that is openmp-enabled. You can also try passing
the --static-exe flag (to configure.py), in which case the executable is
static and can be relocated without any headaches. This works with
certain compilers.
Let me know if there are any problems!
BW, Gabor
----------
From: Francois Berenger
Hello,
How faster is the OpenMP version of Phaser
versus number of cores used?
In the past I have been quite badly surprised by
the no-acceleration I gained when using OpenMP
with some of my programs... :(
Regards,
F.
----------
From: Nat Echols
Amdahl's law is cruel:
http://en.wikipedia.org/wiki/Amdahl's_law
This is the same reason why GPU acceleration isn't very useful for
most crystallography software.
-Nat
----------
From: Ed Pozharski
See page 3 of this
http://www-structmed.cimr.cam.ac.uk/phaser/ccp4-sw2011.pdf
----------
From: Randy Read
Thanks for pointing out that link. The graph makes the point I was going to mention, i.e. that you notice a big difference in using up to about 4 processors for typical jobs, but after that point the non-parallelisable parts of the code start to dominate and there's less improvement. This is very useful if you have one MR job to run on a typical modern workstation (2-8 cores), but if you have several separate jobs to run then you're better off submitting them simultaneously, each using a subset of the available cores. Of course, that assumes you have enough memory for several simultaneous separate jobs!
Regards,
Randy Read
------
Randy J. Read
Department of Haematology, University of Cambridge
----------
From: Pascal
Le Tue, 8 Nov 2011 16:25:22 -0800,
Nat Echols a écrit :
You need big parallel jobs and avoid synchronisations, barriers or this
kind of things. Using data reduction is much more efficient. It's working
very well for structure factors calculations for exemple.
You can have much less than 5% of serial code.
I have more problems with L2 misse cache events and memory bandwidth. A
quad cores means 4 times the bandwidth necessary for a single process...
If your code is already a bit greedy, the scale up is not good.
Pascal
----------
From: Francois Berenger
I never went down to this level of optimization.
Are you using valgrind to detect cache miss events?
After gprof, usually I am done with optimization.
I would prefer to change my algorithm and would be afraid
of introducing optimizations that are architecture-dependent
into my software.
Regards,
F.
----------
From: Pascal
No, I am not sure valgrind can cope with multithread applications correctly.
In this particular case my code is running faster on a intel Q9400@2.67GHz with
800MHz DDR2 than an intel Q9505@2.83GHz with 667MHz DDR2. Also I have
a nice scale up on a 4*12 opteron cpu (each cpu has 2 dual channel memory bus)
but not on my standard quad core. If I get my hands on a i7-920 equipped with
triple channel DDR3 the program should run much faster despite the same cpu clock.
Then I used perf[1] and oprofile[2] on linux.
Have a look here for the whole story:
<http://blog.debroglie.net/2011/10/25/cpu-starvation/> When I spot a bottle neck, it's my first reaction, changing the algorithm.
Caching calculations, more efficient algorithms...
But once I had to do some manual loop tilling. It's kind of a change of algorithm
as the size of a temporary variable change as well but the number of operations
remains the same. The code with the loop tilling is ~20% faster. Only due to a
better use of the cpu cache.
<http://blog.debroglie.net/2011/10/28/loop-tiling/>
[1] http://kernel.org/ package name should be perf-util or similar
[2] http://oprofile.sourceforge.net/
Pascal
Date: 8 November 2011 17:29
Could anyone point me towards instructions on how to get/build
parallelized phaser binary on linux? I searched around but so far found
nothing. The latest updated phaser binary doesn't seem to be
parallelized.
Apologies if this has been resolved before - just point at the relevant
thread, please.
--
"I'd jump in myself, if I weren't so good at whistling."
Julian, King of Lemurs
----------
From: Dr G. Bunkoczi
Hi Ed,
in the CCP4 distribution, openmp is not enabled by default, and there
seems to be no easy way to enable it (i.e. by setting a flag at the
configure stage).
On the other hand, you can easily create a separate build for phaser
that is openmp enabled and use phaser from there. To do this, create a
new folder, say "phaser-build", cd into it, and issue the following
commands (this assumes you are using bash):
$ python $CCP4/lib/cctbx/cctbx_sources/cctbx_project/libtbx/configure.py
--repository=$CCP4/src/phaser/source phaser
--build-boost-python-extensions=False --enable-openmp-if-possible=True
$ . ./setpaths.sh ("source ./setpaths.csh" with csh) $ libtbx.scons (if you have several CPUs, add -jX where X is the number of CPUs you want to use for compilation)
This will build phaser that is openmp-enabled. You can also try passing
the --static-exe flag (to configure.py), in which case the executable is
static and can be relocated without any headaches. This works with
certain compilers.
Let me know if there are any problems!
BW, Gabor
----------
From: Francois Berenger
Hello,
How faster is the OpenMP version of Phaser
versus number of cores used?
In the past I have been quite badly surprised by
the no-acceleration I gained when using OpenMP
with some of my programs... :(
Regards,
F.
----------
From: Nat Echols
Amdahl's law is cruel:
http://en.wikipedia.org/wiki/Amdahl's_law
This is the same reason why GPU acceleration isn't very useful for
most crystallography software.
-Nat
----------
From: Ed Pozharski
See page 3 of this
http://www-structmed.cimr.cam.ac.uk/phaser/ccp4-sw2011.pdf
----------
From: Randy Read
Thanks for pointing out that link. The graph makes the point I was going to mention, i.e. that you notice a big difference in using up to about 4 processors for typical jobs, but after that point the non-parallelisable parts of the code start to dominate and there's less improvement. This is very useful if you have one MR job to run on a typical modern workstation (2-8 cores), but if you have several separate jobs to run then you're better off submitting them simultaneously, each using a subset of the available cores. Of course, that assumes you have enough memory for several simultaneous separate jobs!
Regards,
Randy Read
------
Randy J. Read
Department of Haematology, University of Cambridge
----------
From: Pascal
Le Tue, 8 Nov 2011 16:25:22 -0800,
Nat Echols a écrit :
You need big parallel jobs and avoid synchronisations, barriers or this
kind of things. Using data reduction is much more efficient. It's working
very well for structure factors calculations for exemple.
You can have much less than 5% of serial code.
I have more problems with L2 misse cache events and memory bandwidth. A
quad cores means 4 times the bandwidth necessary for a single process...
If your code is already a bit greedy, the scale up is not good.
Pascal
----------
From: Francois Berenger
I never went down to this level of optimization.
Are you using valgrind to detect cache miss events?
After gprof, usually I am done with optimization.
I would prefer to change my algorithm and would be afraid
of introducing optimizations that are architecture-dependent
into my software.
Regards,
F.
----------
From: Pascal
No, I am not sure valgrind can cope with multithread applications correctly.
In this particular case my code is running faster on a intel Q9400@2.67GHz with
800MHz DDR2 than an intel Q9505@2.83GHz with 667MHz DDR2. Also I have
a nice scale up on a 4*12 opteron cpu (each cpu has 2 dual channel memory bus)
but not on my standard quad core. If I get my hands on a i7-920 equipped with
triple channel DDR3 the program should run much faster despite the same cpu clock.
Then I used perf[1] and oprofile[2] on linux.
Have a look here for the whole story:
<http://blog.debroglie.net/2011/10/25/cpu-starvation/> When I spot a bottle neck, it's my first reaction, changing the algorithm.
Caching calculations, more efficient algorithms...
But once I had to do some manual loop tilling. It's kind of a change of algorithm
as the size of a temporary variable change as well but the number of operations
remains the same. The code with the loop tilling is ~20% faster. Only due to a
better use of the cpu cache.
<http://blog.debroglie.net/2011/10/28/loop-tiling/>
[1] http://kernel.org/ package name should be perf-util or similar
[2] http://oprofile.sourceforge.net/
Pascal
No comments:
Post a Comment