Date: 24 January 2012 04:46
Hello Everyone,
I want to play around with some coding/programming. Just simple calculations from an input PDB file, B factors averages, occupancies, molecular weight, so forth...
What should I use python,C++, visual basic?
thanks
----------
From: Ethan Merritt
What you describe is primarily a task of processing the text in a PDB file.
I would recommend perl, with python as a more trendy alternative.
If this is to be a springboard for a larger project, then you might choose
instead to use a standard library like cctbx to do the fiddly stuff and
call it from a higher level language (C or C++).
Ethan
----------
From: Pavel Afonine
Hi Yuri,
----------
From: Jens Kaiser
I second everythig Ethan Merritt said, but would add: awk is easier and as functional as Perl for "quick and dirty" projects. Once you need Perl's complexity, you're probably better off moving to Python or a compiled language; Perl is powerful, but it allows you to do really dirty coding; I found myself writing an "elegant" Perl script that I did not understand anymore 1/2y later.
I would also add Fortran (maybe Fortran90) to the list of higher level languages.
Disclaimer: I'm not a programmer, I "hack" things together...
Cheers,
Jens
----------
From: Bart Hazes
I have used a number of languages and have found only one I really disliked, that being perl. It is hard for me to imagine that this language was developed by a linguist yet in my eyes it is the least natural language from human comprehension point of view. In contrast, python is much more intuitive and should be very suitable for the tasks you describe.
my 2p
Bart
----------
From: James Stroud
print "Hello World"
class HelloWorldApp {public static void main(String[] args) {System.out.println("Hello World!"); // Display the string.}}
----------
From: Graeme Winter
OK, feel like I need to comment on this one. In terms of general
programming you could use whatever you like, perl (if as was said
above you like write-only programs) tcl, python, c++ etc.
However if you would like to do crystallographic calculations, I can
recommend that Python + CCTBX is excellent. There's an enormous amount
you can do with really surprisingly short python programs.
There's also a CCTBX mailing list where you can get lots of help with things.
Cheerio,
Graeme
----------
From: Tim Gruene
Dear Yuri,
if there is somebody near your office who knows the language XYZ pretty
well, go ahead and use that language. It's probably worth a lot more
having somebody to discuss your code with than to try and find a
consensus from people's opinion about which language is best.
Cheers,
Tim
[flame=;-)]
P.S.: don't use python. It's a nightmare language, sloppy, it forces you
to format the code in a specific way rather than your own way and ...
[/flame]
- --
- --
Dr Tim Gruene
----------
From: Phil Evans
If you know C++, I've found Kevin Cowtan's Clipper libraries very useful for doing little analyses involving coordinate superposition, etc. I'm not sure I would recommend learning C++ just for this purpose (though it's a good language to know). Clipper is distributed in CCP4
Phil
----------
From: Pascal
Hi,
Le lundi 23 janvier 2012 23:29:39 vous avez écrit :
Python is certainly a nice idea, especially with cctx. But it's not perfect as
it's not a compiled language. It really depends on the task you have to
acomplish.
To play around, it's an excellent language.
Fortran come from the 60's and is not evil. depends on what you want do. There
are also quite some code already in Fortran. But so not much in fortran 90.
Fortran is also not too difficult. For example, if you stay away from pointers,
there is no memory leak possible by design.
Python is fast if you don't use python that's true :D
A good start is not to use loops or at least as few as possible. loops are
horribly slow. Instead you need to rely heavily on numpy, it's a good way to
remember your linear algebra :)
Low level languages have their advantages. It's not a good idea to stay away
from them just because there is more work or it's more difficult.
You cannot have 50M reflections*atoms*s^-1 processed for structure factor
calculations in python. But it's true you probably won't need it every day.
Brute force has some advantages over cool ideas sometimes ;)
Pascal
----------
From: Paul Emsley
Yuri,
My 2c...
It depends what you want. If you want to gain experience programming then go ahead, write a PDB parser in whatever language you like: awk, python, perl, C++... (I'd suggest nothing proprietary).
If you want to produce programs that will work for others and in 5 years from now - when the PDB format is burnt and buried - that will be useless of course. I'd suggest (as others) using cctbx or mmdb (which is what the professionals use).
Or, dare I add, you could use the coordinate interface exposed in Coot and write a plug-in.
Musing: I have been toying with the idea of adding boost to Coot dependencies. Boost has grown on me. I think that it would be useful to have a Boost.Python-based wrappers to mmdb and Clipper.
Paul.
----------
From: David Schuller
An Empirical Comparison of the Accuracy Rates of Novices using the Quorum, Perl, and Randomo Programming Languages
A. Stefik, S. Siebert, M. Stefik, K. Slattery
Abstract: "... Perl users were unable to write programs more accurately than those using a language designed by chance."
-- ======================================================================= All Things Serve the Beam ======================================================================= David J. Schuller modern man in a post-modern world MacCHESS, Cornell University
From: Frank von Delft
----------
From: Debreczeni, Judit
Another vote for python+cctbx.
And a bit of fun:
http://comicjk.com/comic.php/217
http://comicjk.com/comic.php/823
http://comicjk.com/comic.php/842
http://comicjk.com/comic.php/849
----------
From: Ian Tickle
... and from the same paper: "Students tell us that the syntax they
are learning (in C++ at our school), makes no sense and some have
difficulty writing even basic computer programs.".
Maybe the difficulty the students in the test group experienced had a
lot to do with the fact they were complete novices and had no
experience whatsoever of any computer language; also the language
syntax was not explained to them prior to the test, they had to deduce
it during the test time itself (21 mins) from code samples.
Note that Python wasn't tested, so the fact that they found Perl
syntax difficult to deduce from the samples doesn't necessarily imply
that they wouldn't also have had the same difficulty with Python. One
of the difficulties reported with Perl was the use of the 'for'
keyword to introduce a loop (novices apparently report that 'repeat'
is 7 times more intuitive than 'for'). But Python and Perl (and C/C++
of course) both use the 'for' keyword in a loop construct, and maybe
all this paper proves is that 'repeat' is easier to comprehend than
'for' (or any other loop syntax)! I remember when I first learned
Fortran the 'DO' loop was the hardest concept to grasp (not so much
the syntax, but the concept of looping itself, with variables
potentially having different values on each pass through the loop):
this was followed closely by the FORMAT statement! I think every
programming novice finds the concept of looping difficult at first in
whatever language they are using: you can often recognise novice code
because it studiously avoids the use of loops!
Personally I think there's plenty of opportunity to write bad (and
good) code in any language; for me the choice of language is a
personal one and not one I lose any sleep over. Far more important to
me than writing beautiful code is getting the algorithm right,
particularly as it affects numerical precision. Debugging the syntax
is the easy bit, debugging the algorithm is always the hard part.
Cheers
-- Ian
----------
From: James Stroud
----------
From: Bart Hazes
I agree with all of the above but it does not help someone who is asking about what language(s) to look into for a first foray into programming. From the original description Yuri is one of the many people who want to learn computer programming in general and apply it, probably on a rather infrequent basis, to relatively straightforward tasks that don't rely on speed, integration with the internet, special libraries etc. In such cases I think python should be at, or near, the top of the list of programming languages to look at and perl near the bottom. That doesn't mean python is not suitable for much more demanding tasks but if you are going to program on a daily basis or start on a project with special needs you really should study the strengths and weaknesses of each language in more detail.
Bart
PS: I think even novices should not have too much difficulty understanding how the example program below produces the results at the bottom.
languageList = ["python", "java", "...", "perl"]
print "Ranking of some computer languages based on ease of use"
i = 1
for language in languageList:
print i, language
i = i + 1
result
Ranking of some computer languages based on ease of use
1 python
2 java
3 ...
4 perl
----------
From: Ian Tickle
I'm inclined to go along with you there: the main reason I've put off
learning Python is that it uses spacing characters (space, tab and
newline) as an integral part of the syntax and I've always felt
uncomfortable with that (memories of fixed format code and data I
suppose!). Somehow a semicolon at the end of the line has a
reassuring air of finality! Maybe a Python expert will answer this
but I've often wondered, what happens if as some editors do
(particularly if as I do you have to use different editors at
different times depending on where you are working, such as on Windows
working remotely from home or Linux at work), you could have a mixture
of space and tab characters in the file? Often you don't know what
space/tab characters the editor is putting in and you shouldn't have
to know: the principle should be that if it looks right in the editor
then it is right. So does Python automatically expand the tabs to the
equivalent number of spaces or (as in data input) are they treated as
single characters? And anyway how does Python know what tab stops my
editors are set to and indeed how exactly my editors treat tabs? The
appearance of the file doesn't tell you what's going on, e.g. N spaces
followed by a tab looks exactly the same as just a tab if N <8. I
would hope that the answer is that I don't need to worry about any of
this!
Cheers
-- Ian
----------
From: Nat Echols
Yes, this can happen. In practice, one learns very quickly to
configure the text editors to prevent this kind of mix-up, i.e. by
always inserting spaces instead of tab characters. In vim, for
instance, you can do this:
set expandtab
set tabstop=2
For CCTBX, the rule is to use two spaces as the indentation (tabs
strictly forbidden), and despite having multiple active contributors,
this is almost never a problem. (The rule for most other Python
modules appears to be four spaces, which I personally find too wide.)
It becomes second nature after a while, just like adding a semicolon
at the end of each statement in C/C++/Java/etc. I agree that it seems
annoying and confusing at first, but if you've ever tried to edit
someone else's C or Perl code where the indentation was totally
inconsistent, you'll quickly learn to appreciate Python's style.
Answers are "no" and "it doesn't", at least I don't think so. The
safest thing (especially if you need to copy and paste code from any
other module) is to never use literal tabs.
-Nat
----------
From: James Stroud
----------
From: James Holton
When did writing scripts become "off topic" for the CCP4BB!?
Personally, I use awk for most text-processing tasks, and since I have found that ~95% of science is converting information from one file format into another, I tend to use awk a lot.
The "k" in awk stands for Kernighan, one of the authors of "C", so the syntax is very similar. Strange to me how one of the first things you hear about any "great new language" in the last 20 years has been "it has a C-like syntax". Anyway, the main differences between awk and C is that the reading in and parsing of a text file line-by-line is implicit, as are variable types. That is, strings and numbers inter-convert automatically, depending on how you use them. A very short intro to awk (and the only one I have ever read) is here:
http://bl831.als.lbl.gov/~jamesh/pickup/awk.pdf
plus my own personal "tips & tricks" for using awk in crystallography:
http://bl831.als.lbl.gov/~jamesh/elves/manual/tricks.html#awk
and here is an awk program I wrote for taking apart a PDB file and then putting it back together:
http://bl831.als.lbl.gov/~jamesh/pickup/reformatpdb.awk
the last may be a bit "advanced" at first, but if you find yourself regularly doing weird things to PDBs that are not standard features of programs you already have like PDBSET or phenix.pdbtools, then it may be a good place to start.
That said, I should warn you that the "what is the best language" question is a sure-fire way to start a flame war. Pretty much everyone you talk to who has done even a little programming will have a "favorite language" that they will defend as emphatically and energetically as they disparage every other language. The reason for this is a sad truth: learning a language is work, and people tend to protect their investments. This is just as true for computer languages as it is for "regular" languages, and although we have things like "Rosetta Stone" and "Hello World", nothing is ever going to change the fact that communicating complex ideas (be it to a machine or to a human) requires learning a complex language. So, when you find that someone is unusually "pushy" about their favorite language: be afraid. The language may be dying, and they must have invested a tremendous amount of time and effort learning it if they are so desperate for you to join their ranks.
Oh, and don't fall for the "so other people can read your code" trick. Trust me, NOBODY wants to read your code! Unless, of course, they are trying to re-write it in their favorite language.
-James Holton
MAD Scientist
----------
From: Pete Meyer
I don't think this is necessarily the case. If I'm using your code to do something scientifically interesting, then I most definitely DO want to read it. If your code crashes on my data, then I'm going to read it to figure out why (and probably fire up the debugger if necessary).
And even if I don't read your code, the fact that I have the ability to makes me more likely to try it out (in my experience, there's a general trend relating software I can't read with software that has irritating bugs).
That said, I'm probably in the minority of "other people". And I'm probably not going to read the boring parts of a program unless I suspect there's a problem with them.
For my two cents on the original question - if you're using a library, use the language that library's written in. Otherwise, use whatever language you feel most comfortable expressing the solution to your problem in.
Pete
----------
From: Mark Brooks
Structure I/O and Manipulation
How do I read a structure file?
The current implementation only reads PDB format files, there will hopefully be code for parsing mmCIF and PDB XML files in the future. Adding these new formats will probably change the API for reading and writing structure files.
Given the above caveat, the current way to read a PDB file is to slurp the file into a string which is fed to the constructor of the Bio::PDB class:
require 'bio/db/pdb' string = IO.read('pdb1tim.ent') structure = Bio::PDB.new(string) The new Bio::PDB object now contains all the annotations and coordinate data.
How do I write a structure file?
As with reading, the current implementation can only write in PDB format.
All the coordinate classes in the Bio::PDB heirarchy have .to_s methods which return the object in PDB format. This makes output as simple as:
require 'bio/db/pdb' file = File.new('pdb1tim.ent').gets(nil) structure = Bio::PDB.new(file) puts structure.to_s # Writes whole structure puts structure[nil]['A'].to_s # Writes chain A only
How do I access annotations?
The annotations from the PDB file are stored in Bio::PDB::Record objects. You can retrieve a specific annotation by calling the '.record' method of a Bio::PDB object with the name of the annotation as the argument:
structure.record("HEADER")
In fact '.record' returns an array of all Bio::PDB::Records of the given type, so you'll need to call '.first' to get to the actual Bio::PDB::Record. Bio::PDB::Record provides methods to get to the individual data in each record. E.g. The 'HEADER' record provides classification, depDate and idCode methods. You can look at the PDB format to see what fields are available in each record, or just look in the pdb.rb file which has a hash of all the definitions.
So, to get the title of a PDB file you would do it like this:
structure.record('TITLE').first.title
Some records have their own special methods:
structure.remark(num) #Returns hash of the REMARK records based on 'num' structure.jrnl #Returns a hash of the JRNL records structure.jrnl('TITL') #Returns an array of all the TITL subfields from #the JRNL records structure.helix(id) #Returns the HELIX records based on 'id' #Returns an array if no 'id' is given structure.turn #Same interface as '.helix' structure.sheet #Same interface as '.sheet' structure.seqres(id) #Returns the sequence given in the SEQRES record #as a Bio::Sequence::AA object structure.keywords #returns a flattened lsit of keywords
And some methods are included for Bio::DB compatability:
structure.entry_id #Returns idCode structure.accession #Same as entry_id structure.definition #Title structure.version #REVDAT
How do I access the coordinate data?
Coordinate data is stroed in a heirarchy of classes with Bio::PDB at the top. A Bio::PDB object contains one or more Bio::PDB::Model objects which contain one or more Bio::PDB::Chain objects which contain one or more Bio::PDB::Residue objects which contain Bio::PDB::Atom objects.
There are two ways to access the coordinate data in a Bio::PDB object: keyed access and iterators.
Keyed access provides the simplest way to get to data if you know which residues or atoms you're interested in. For instance if you wanted to get hold of chain 'A' you can do it like this:
chain = structure[nil]['A']
The nil refers to the model, which will be nil in crystal structures but may have a number in NMR structures. Every level in the heirarchy can be accessed this way. E.g. to get hold of the CA atom of residue 100 in chain 'A':
structure[nil]['A']['100']['CA']
or if you still have your chain object:
chain['100']['CA']
The residue identifier is not just the residue number. PDB residues can have insertion codes and even negative numbers.
chain['100A']['CA']
could be valid.
Iterators are also provided so you can easily go through all objects in the heirarchy. E.g:
structure.each{ |model| model.each{ |chain| chain.each{ |residue| residue.each{ |atom| puts atom } } } }
Goes through every atom in the structure and outputs it. All the classes in the heirarchy implement the standard Enumerable and Comparable mixins as well.
Each class has a number of accessors to allow access to the data, most is in the Bio::PDB::Atom class:
- Bio::PDB::Model has a 'model_serial' accessor only.
- Bio::PDB::Chain has an 'id' accessor for getting and setting the chain id.
- Bio::PDB::Residue has accessors for resName, resSeq and iCode.
- Bio::PDB::Atom has accessors for serial, element, alt_loc, x, y, z, occ, and bfac
To find the x coordinate of the CA atom of residue 100 is then:
structure[nil]['A']['100']['CA'].x
How do I make a deep copy of a structure object?
You can't! There seem to be problems implementing a deep copy function in BioRuby. This is to be fixed in future versions.
Note to developers: have we tried the time-old hack of Marshal.load(Marshal.dump(obj)) to do the trick?
How do I measure distances and angles?
Methods for measuring distance and dihedral angles are provided in the Utils module. To measure distance, use the Bio::PDB::Utils.distance method:
res1 = structure[nil]['A']['100'] res2 = structure[nil]['A']['101'] distance = Bio::PDB::Utils.distance(res1['CA'].xyz, res2['CA'].xyz)
Bio::PDB::Utils.distance requires two Vectors as its arguments. Bio::PDB::Coordinate objects, which are produced by the .xyz call to the Bio::PDB::Atom object, are Vectors. You can also provide a Vector directly:
distance = Bio::PDB::Utils.distance(res1['CA'].xyz, [10,10,10])
And use the .centreOfGravity and .geometricCentre methods, both of which provide a Bio::PDB::Coordinate:
distance = Bio::PDB::Utils.distance(res1.centreOfGravity, res1.geometricCentre)
All objects in the Bio::PDB heirarchy implement the centreOfGravity and geometricCentre methods.
Dihedral angles are calculated from four Vectors / Bio::PDB::Coordinates:
phi = Bio::PDB::Utils.dihedral_angle(res1['C'].xyz, res2['N'].xyz, res2['CA'].xyz, res2['C'].xyz) psi = Bio::PDB::Utils.dihedral_angle(res2['N'].xyz, res2['CA'].xyz, res2['C'].xyz, res3['N'].xyz)
How do I find specific atoms, residues, chains or models?
If the iterators and keyed access aren't powerful enough for you, the finder method is also provided. All the objects in the Bio::PDB hierarchy implement finder. It requires the class of object you wish to find as the first argument, and then a block which is evaluated for each object. If the block returns true then that object is added to an array of 'found' things.
For example, to find all the atoms within a box:
atoms_in_box = structure.finder(Bio::PDB::Atom){ |atom| atom.x.between?(10,20) && atom.y.between?(20,30) && atom.z.between?(30,40) }
Or, to find all the HIS residues, in chain 'A' can be written like this:
his_residues = structure[nil]['A'].finder(Bio::PDB::Residue){ |res| res.resName == 'HIS' }
or like this:
his_residues = structure.finder(Bio::PDB::Residue){ |res| res.resName == 'HIS' && res.chain.id == 'A' }
Since the results are returned in arrays, it is very simple to do set operations, such as finding the overlap or difference between two selections:
sel1 = structure.finder(Bio::PDB::Atom){ |atom| atom.residue.resName == 'HIS' } sel2 = structure.finder(Bio::PDB::Atom){ |atom| atom.x.between?(10,20) && atom.y.between?(20,30) && atom.z.between?(30,40) } sel3 = sel1 & sel2 sel1 = sel1 - sel3
Selection 3 now contains all the HIS atoms within the box and selection 1 contains all the HIS atoms outside the box.
How do I work with ligand data?
Because some PDB files reuse residue numbers that they already used in the ATOM records in the HETATM records, we have to add an extra flag when we want to access HETATM data. The extra flag is simply adding 'LIGAND' to the residue id. E.g Given the PDB file
ATOM 1 N GLY A 2 -5.308 22.647 33.657 1.00 59.93 ATOM 2 CA GLY A 2 -5.090 23.965 33.005 1.00 53.36 ATOM 3 C GLY A 2 -6.209 24.197 32.021 1.00 52.56 ATOM 4 O GLY A 2 -7.000 25.134 32.176 1.00 55.02 .... HETATM 2097 O8 PHT A 2 -18.122 40.603 18.146 1.00 16.14 HETATM 2098 O9 PHT A 2 -17.348 39.940 20.109 1.00 16.06 HETATM 2099 C10 PHT A 2 -15.622 41.753 16.662 1.00 13.34 HETATM 2100 O11 PHT A 2 -16.077 42.457 15.721 1.00 16.27Asking for
structure[nil]['A']['2']is ambiguous, so if you want the ligand data you must use
structure[nil]['A']['LIGAND2']
There should also be an .each_ligand method in the Bio::PDB::Chain class for iterating through the ligands, but this is not implemented in the current cvs version.
How do I work with solvent?
Solvent is defined in the current parser as any HETATM record with HOH as the residue type. This may not be ideal for some files, but deals with 95% of cases which is good enough for me at the moment!
Solvent residues are stored in a separate Bio::PDB::Chain attached to each Bio::PDB::Model. Currently the only methods available are 'addSolvent' which adds a residue to the solvent chain and 'removeSolvent', which sets the whole solvent chain to nil.
An each_solvent method for Bio::PDB::Model has been implemented in my copy but is not in cvs at the moment.
How do I get the sequence of a chain?
There are two types of sequence in a PDB file: the real sequence of whatever protein has been crystalised, and the sequence observable in the structure.
The first type of sequence is available through the SEQRES record (if this is provided in the PDB file). You can access this through the top-level Bio::PDB object:
structure.seqres['A']
provides the sequence given in the SEQRES record for chain A. The observed sequence is obtained from the Bio::PDB::Chain object. E.g.
structure[nil]['A'].atom_seq
Both methods return a Bio::Sequence::AA object.
----------
From: Miguel Ortiz Lombardía
Hi Yuri,
To add something nobody else has mentioned yet.
If you are interested in statistics and analysis you may find
interesting the R package. There is an excelent module, called Bio3D for
the analysis of protein structure and sequence data. Have a look here:
http://bio3d.pbworks.com/w/page/7824467/FrontPage
I use R for a number of things. The learning curve is not that steep.
But may not be what you're looking for.
Best regards,
--
Miguel
No comments:
Post a Comment