Sunday, 12 February 2012

writing scripts-off topic

From: Yuri Pompeu
Date: 24 January 2012 04:46


Hello Everyone,
I want to play around with some coding/programming. Just simple calculations from an input PDB file, B factors averages, occupancies, molecular weight, so forth...
What should I use python,C++, visual basic?
thanks

----------
From: Ethan Merritt


What you describe is primarily a task of processing the text in a PDB file.
I would recommend perl, with python as a more trendy alternative.

If this is to be a springboard for a larger project, then you might choose
instead to use a standard library like cctbx to do the fiddly stuff and
call it from a higher level language (C or C++).

       Ethan

----------
From: Pavel Afonine


Hi Yuri,

for example, you can use cctbx for this. Using cctbx you can do somethings as simple as b-factor statistics (see example below) or as complex as write your own refinement program (phenix.refine can serve as an example).

Example: compute min/max/mean B-factor for all atoms and for CA atoms in chain A:

********
from scitbx.array_family import flex
import iotbx.pdb
import sys

def run(args):
  assert len(args) == 1
  pdb_file_name = args[0]
  pdb_input = iotbx.pdb.input(file_name = pdb_file_name)
  atoms = pdb_input.atoms()
  b_factors = atoms.extract_b()
  print "Min, max, mean b_factor:", flex.min(b_factors), flex.max(b_factors), \
    flex.mean(b_factors)
  pdb_hierarchy = pdb_input.construct_hierarchy()
  selection_ca_atoms = pdb_hierarchy.atom_selection_cache().selection(
    string = "chain A and name CA")
  b_factors_ca = b_factors.select(selection_ca_atoms)
  print "Number of CA atoms in chain A:", selection_ca_atoms.count(True)
  print "Total number of atoms:", selection_ca_atoms.size()
  print "Min, max, mean b_factor for CA atoms in chain A:", \
    flex.min(b_factors_ca), flex.max(b_factors_ca), flex.mean(b_factors_ca)
  
if (__name__ == "__main__"):
  run(args=sys.argv[1:])
********

Save the code enclosed between "********" into a file, say example.py, and then run it with your PDB file like this:

cctbx.python example.py model.pdb

and it will output something like this:

Min, max, mean b_factor: 4.4 44.08 9.82779166667
Number of CA atoms in chain A: 16
Total number of atoms: 240
Min, max, mean b_factor for CA atoms in chain A: 5.14 8.29 6.390625

Pavel

----------
From: Jens Kaiser


Yuri,
  I second everythig Ethan Merritt said, but would add: awk is easier and as functional as Perl for "quick and dirty" projects. Once you need Perl's complexity, you're probably better off moving to Python or a compiled language; Perl is powerful, but it allows you to do really dirty coding; I found myself writing an "elegant" Perl script that I did not understand anymore 1/2y later.
  I would also add Fortran (maybe Fortran90) to the list of higher level languages.
  Disclaimer: I'm not a programmer, I "hack" things together...

Cheers,

Jens




----------
From: Bart Hazes


I have used a number of languages and have found only one I really disliked, that being perl. It is hard for me to imagine that this language was developed by a linguist yet in my eyes it is the least natural language from human comprehension point of view. In contrast, python is much more intuitive and should be very suitable for the tasks you describe.

my 2p

Bart

----------
From: James Stroud


Python is the most practical. Here is a simple python program:

print "Hello World"

Feel the power.

Python can be that simple or can be arbitrarily complex, nuanced, or abstract. You can write entire applications in python or small utilities. If you practice good habits, you will begin building reusable libraries from day one, saving time over the long haul.

STAY AWAY from proprietary nonsense like visual basic and from languages that do not facilitate reusability, like perl or other 1980's era shell languages. You will find yourself porting or abandoning your code, which is not a good use of your time.

I also do not recommend overweight languages like java, which create "programs" that never seem to deploy correctly and take about 5 times more code to create than should be necessary. Here's the java "Hello World":


class HelloWorldApp {
    public static void main(String[] args) {
        System.out.println("Hello World!"); // Display the string.
    }
}


Public static void main? Don't bother.

And python can be VERY fast for calculations if you use free and popular libraries like numpy and scipy. These librares are wrappers around optimized fortran and C libraries that you will never have to use directly.

I recommend staying away from very low level languages like C or fortran, too. It is good to know these languages, but not so good to use them. Your creativity should go towards implementing cool ideas and should not be squandered on plugging memory leaks. It's better to use high level languages that leverage your time most effectively.

James


----------
From: Graeme Winter


OK, feel like I need to comment on this one. In terms of general
programming you could use whatever you like, perl (if as was said
above you like write-only programs) tcl, python, c++ etc.

However if you would like to do crystallographic calculations, I can
recommend that Python + CCTBX is excellent. There's an enormous amount
you can do with really surprisingly short python programs.

There's also a CCTBX mailing list where you can get lots of help with things.

Cheerio,

Graeme

----------
From: Tim Gruene 

Dear Yuri,

if there is somebody near your office who knows the language XYZ pretty
well, go ahead and use that language. It's probably worth a lot more
having somebody to discuss your code with than to try and find a
consensus from people's opinion about which language is best.

Cheers,
Tim

[flame=;-)]
P.S.: don't use python. It's a nightmare language, sloppy, it forces you
to format the code in a specific way rather than your own way and ...
[/flame]
- --
- --
Dr Tim Gruene


----------
From: Phil Evans

If you know C++, I've found Kevin Cowtan's Clipper libraries very useful for doing little analyses involving coordinate superposition, etc. I'm not sure I would recommend learning C++ just for this purpose (though it's a good language to know). Clipper is distributed in CCP4

Phil

----------
From: Pascal


Hi,

Le lundi 23 janvier 2012 23:29:39 vous avez écrit :
Python is certainly a nice idea, especially with cctx. But it's not perfect as
it's not a compiled language. It really depends on the task you have to
acomplish.

To play around, it's an excellent language.
Fortran come from the 60's and is not evil. depends on what you want do. There
are also quite some code already in Fortran. But so not much in fortran 90.
Fortran is also not too difficult. For example, if you stay away from pointers,
there is no memory leak possible by design.
Python is fast if you don't use python that's true :D
A good start is not to use loops or at least as few as possible. loops are
horribly slow. Instead you need to rely heavily on numpy, it's a good way to
remember your linear algebra :)
Low level languages have their advantages. It's not a good idea to stay away
from them just because there is more work or it's more difficult.

You cannot have 50M reflections*atoms*s^-1 processed for structure factor
calculations in python. But it's true you probably won't need it every day.
Brute force has some advantages over cool ideas sometimes ;)

Pascal

----------
From: Paul Emsley


Yuri,

My 2c...

It depends what you want.  If you want to gain experience programming then go ahead, write a PDB parser in whatever language you like: awk, python, perl, C++... (I'd suggest nothing proprietary).

If you want to produce programs that will work for others and in 5 years from now - when the PDB format is burnt and buried - that will be useless of course.  I'd suggest (as others) using cctbx or mmdb (which is what the professionals use).

Or, dare I add, you could use the coordinate interface exposed in Coot and write a plug-in.

Musing:  I have been toying with the idea of adding boost to Coot dependencies.  Boost has grown on me.  I think that it would be useful to have a Boost.Python-based wrappers to mmdb and Clipper.

Paul.

----------
From: David Schuller


www.cs.siue.edu/~astefik/papers/StefikPlateau2011.pdf

An Empirical Comparison of the Accuracy Rates of Novices using the Quorum, Perl, and Randomo Programming Languages
A. Stefik, S. Siebert, M. Stefik, K. Slattery

Abstract: "... Perl users were unable to write programs more accurately than those using a language designed by chance."
--  ======================================================================= All Things Serve the Beam =======================================================================                                David J. Schuller                                modern man in a post-modern world                                MacCHESS, Cornell University                              
----------
From: Frank von Delft


What an astonishingly low bar....

----------
From: Debreczeni, Judit



----------
From: Ian Tickle

... and from the same paper: "Students tell us that the syntax they
are learning (in C++ at our school), makes no sense and some have
difficulty writing even basic computer programs.".

Maybe the difficulty the students in the test group experienced had a
lot to do with the fact they were complete novices and had no
experience whatsoever of any computer language; also the language
syntax was not explained to them prior to the test, they had to deduce
it during the test time itself (21 mins) from code samples.

Note that Python wasn't tested, so the fact that they found Perl
syntax difficult to deduce from the samples doesn't necessarily imply
that they wouldn't also have had the same difficulty with Python.  One
of the difficulties reported with Perl was the use of the 'for'
keyword to introduce a loop (novices apparently report that 'repeat'
is 7 times more intuitive than 'for').  But Python and Perl (and C/C++
of course) both use the 'for' keyword in a loop construct, and maybe
all this paper proves is that 'repeat' is easier to comprehend than
'for' (or any other loop syntax)!  I remember when I first learned
Fortran the 'DO' loop was the hardest concept to grasp (not so much
the syntax, but the concept of looping itself, with variables
potentially having different values on each pass through the loop):
this was followed closely by the FORMAT statement!  I think every
programming novice finds the concept of looping difficult at first in
whatever language they are using: you can often recognise novice code
because it studiously avoids the use of loops!

Personally I think there's plenty of opportunity to write bad (and
good) code in any language; for me the choice of language is a
personal one and not one I lose any sleep over.  Far more important to
me than writing beautiful code is getting the algorithm right,
particularly as it affects numerical precision.  Debugging the syntax
is the easy bit, debugging the algorithm is always the hard part.

Cheers

-- Ian

----------
From: James Stroud

I should know better than to touch a flame post, but, but here goes:

<flame>
don't use anything but python. They are nightmare languages, sloppy, and force you to format the code by inserting semantically redundant brackets and semicolons in a specific way rather than dispensing these redundancies altogether
</flame>
Also, if you use python and you want to format your code in your own way, you should learn about python's "I don't care if anyone, including me, can read my code" parenthetical construct:


print "\n".join((lambda Ru,Ro,Iu,Io,IM,Sx,Sy:reduce(lambda x,y:x+y,map(lambda y,
Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,Sy=Sy,L=lambda yc,Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,i=IM,
Sx=Sx,Sy=Sy:reduce(lambda x,y:x+y,map(lambda x,xc=Ru,yc=yc,Ru=Ru,Ro=Ro,
i=i,Sx=Sx,F=lambda xc,yc,x,y,k,f=lambda xc,yc,x,y,k,f:(k<=0)or (x*x+y*y
>=4.0) or 1+f(xc,yc,x*x-y*y+xc,2.0*x*y+yc,k-1,f):f(xc,yc,x,y,k,f):chr(
64+F(Ru+x*(Ro-Ru)/Sx,yc,0,0,i)),range(Sx))):L(Iu+y*(Io-Iu)/Sy),range(Sy
))))(-2.1, 0.7, -1.2, 1.2, 30, 60, 24)[i*60:(i+1)*60] for i in xrange(24))



James


----------
From: Bart Hazes


I agree with all of the above but it does not help someone who is asking about what language(s) to look into for a first foray into programming. From the original description Yuri is one of the many people who want to learn computer programming in general and apply it, probably on a rather infrequent basis, to relatively straightforward tasks that don't rely on speed, integration with the internet, special libraries etc. In such cases I think python should be at, or near, the top of the list of programming languages to look at and perl near the bottom. That doesn't mean python is not suitable for much more demanding tasks but if you are going to program on a daily basis or start on a project with special needs you really should study the strengths and weaknesses of each language in more detail.

Bart


PS: I think even novices should not have too much difficulty understanding how the example program below produces the results at the bottom.

languageList = ["python", "java", "...", "perl"]

print "Ranking of some computer languages based on ease of use"
i = 1
for language in languageList:
 print i, language
 i = i + 1


result

Ranking of some computer languages based on ease of use
1 python
2 java
3 ...
4 perl


----------
From: Ian Tickle

I'm inclined to go along with you there: the main reason I've put off
learning Python is that it uses spacing characters (space, tab and
newline) as an integral part of the syntax and I've always felt
uncomfortable with that (memories of fixed format code and data I
suppose!).  Somehow a semicolon at the end of the line has a
reassuring air of finality!  Maybe a Python expert will answer this
but I've often wondered, what happens if as some editors do
(particularly if as I do you have to use different editors at
different times depending on where you are working, such as on Windows
working remotely from home or Linux at work), you could have a mixture
of space and tab characters in the file?  Often you don't know what
space/tab characters the editor is putting in and you shouldn't have
to know: the principle should be that if it looks right in the editor
then it is right.  So does Python automatically expand the tabs to the
equivalent number of spaces or (as in data input) are they treated as
single characters?  And anyway how does Python know what tab stops my
editors are set to and indeed how exactly my editors treat tabs?  The
appearance of the file doesn't tell you what's going on, e.g. N spaces
followed by a tab looks exactly the same as just a tab if N <8.  I
would hope that the answer is that I don't need to worry about any of
this!

Cheers

-- Ian

----------
From: Nat Echols

Yes, this can happen.  In practice, one learns very quickly to
configure the text editors to prevent this kind of mix-up, i.e. by
always inserting spaces instead of tab characters.  In vim, for
instance, you can do this:

set expandtab
set tabstop=2

For CCTBX, the rule is to use two spaces as the indentation (tabs
strictly forbidden), and despite having multiple active contributors,
this is almost never a problem.  (The rule for most other Python
modules appears to be four spaces, which I personally find too wide.)
It becomes second nature after a while, just like adding a semicolon
at the end of each statement in C/C++/Java/etc.  I agree that it seems
annoying and confusing at first, but if you've ever tried to edit
someone else's C or Perl code where the indentation was totally
inconsistent, you'll quickly learn to appreciate Python's style.
Answers are "no" and "it doesn't", at least I don't think so.  The
safest thing (especially if you need to copy and paste code from any
other module) is to never use literal tabs.

-Nat

----------
From: James Stroud


Just have a policy of no tabs. I make sure all of my editors interpret a tab as a set number of spaces (I like 2).

A tab was meant as a separator and not a formatting character anyway. Have you ever edited a document where someone used 8 tabs to center a line of text rather than simply applying "centered" formatting? If you have, you'll realize that tabs are not meant to be formatting characters.

When I first picked up a python book and it told me that indent had semantic relevance, I immediately thought of fortran, so almost gave it up. Two days later, python's use of whitespace seemed more natural than the alternative. Let's look at the Java hello world program again
Notice that this obviously veteran programmer has used whitespace formatting in a natural way to improve the readability of the code. Java did not enforce this. If a language harnesses this formatting for semantics, then all of the brackets and the semicolon become redundant.

You probably already format your code in a way consistent with python's whitespace rules if you have been programming for any length of time.

James


----------
From: James Holton

When did writing scripts become "off topic" for the CCP4BB!?


Personally, I use awk for most text-processing tasks, and since I have found that ~95% of science is converting information from one file format into another, I tend to use awk a lot.

The "k" in awk stands for Kernighan, one of the authors of "C", so the syntax is very similar.  Strange to me how one of the first things you hear about any "great new language" in the last 20 years has been "it has a C-like syntax".  Anyway, the main differences between awk and C is that the reading in and parsing of a text file line-by-line is implicit, as are variable types.  That is, strings and numbers inter-convert automatically, depending on how you use them.  A very short intro to awk (and the only one I have ever read) is here:
http://bl831.als.lbl.gov/~jamesh/pickup/awk.pdf
plus my own personal "tips & tricks" for using awk in crystallography:
http://bl831.als.lbl.gov/~jamesh/elves/manual/tricks.html#awk
and here is an awk program I wrote for taking apart a PDB file and then putting it back together:
http://bl831.als.lbl.gov/~jamesh/pickup/reformatpdb.awk
the last may be a bit "advanced" at first, but if you find yourself regularly doing weird things to PDBs that are not standard features of programs you already have like PDBSET or phenix.pdbtools, then it may be a good place to start.

That said, I should warn you that the "what is the best language" question is a sure-fire way to start a flame war.  Pretty much everyone you talk to who has done even a little programming will have a "favorite language" that they will defend as emphatically and energetically as they disparage every other language.  The reason for this is a sad truth: learning a language is work, and people tend to protect their investments.  This is just as true for computer languages as it is for "regular" languages, and although we have things like "Rosetta Stone" and "Hello World", nothing is ever going to change the fact that communicating complex ideas (be it to a machine or to a human) requires learning a complex language.  So, when you find that someone is unusually "pushy" about their favorite language: be afraid.  The language may be dying, and they must have invested a tremendous amount of time and effort learning it if they are so desperate for you to join their ranks.

Oh, and don't fall for the "so other people can read your code" trick.  Trust me, NOBODY wants to read your code!  Unless, of course, they are trying to re-write it in their favorite language.

-James Holton
MAD Scientist

----------
From: Pete Meyer


I don't think this is necessarily the case.  If I'm using your code to do something scientifically interesting, then I most definitely DO want to read it.  If your code crashes on my data, then I'm going to read it to figure out why (and probably fire up the debugger if necessary).

And even if I don't read your code, the fact that I have the ability to makes me more likely to try it out (in my experience, there's a general trend relating software I can't read with software that has irritating bugs).

That said, I'm probably in the minority of "other people".  And I'm probably not going to read the boring parts of a program unless I suspect there's a problem with them.


For my two cents on the original question - if you're using a library, use the language that library's written in.  Otherwise, use whatever language you feel most comfortable expressing the solution to your problem in.

Pete

----------
From: Mark Brooks


Hi Yuri,
          If you don't like Python, like myself (and I'm not alone, it would seem), you could try Ruby (http://www.ruby-lang.org/en/). Some examples of PDB file manipulation are below (taken from [1]).
 
The language is a great improvement in Perl and Python in my opinion, but the downside is that there aren't as many crystallographic features as CCTBX et al, I don't think.
 
Yours,
 
Mark
 

Structure I/O and Manipulation

How do I read a structure file?

The current implementation only reads PDB format files, there will hopefully be code for parsing mmCIF and PDB XML files in the future. Adding these new formats will probably change the API for reading and writing structure files.

Given the above caveat, the current way to read a PDB file is to slurp the file into a string which is fed to the constructor of the Bio::PDB class:

require 'bio/db/pdb' string = IO.read('pdb1tim.ent') structure = Bio::PDB.new(string)   The new Bio::PDB object now contains all the annotations and coordinate data.

How do I write a structure file?

As with reading, the current implementation can only write in PDB format.

All the coordinate classes in the Bio::PDB heirarchy have .to_s methods which return the object in PDB format. This makes output as simple as:

require 'bio/db/pdb' file = File.new('pdb1tim.ent').gets(nil) structure = Bio::PDB.new(file) puts structure.to_s # Writes whole structure puts structure[nil]['A'].to_s # Writes chain A only

How do I access annotations?

The annotations from the PDB file are stored in Bio::PDB::Record objects. You can retrieve a specific annotation by calling the '.record' method of a Bio::PDB object with the name of the annotation as the argument:

structure.record("HEADER")

In fact '.record' returns an array of all Bio::PDB::Records of the given type, so you'll need to call '.first' to get to the actual Bio::PDB::Record. Bio::PDB::Record provides methods to get to the individual data in each record. E.g. The 'HEADER' record provides classification, depDate and idCode methods. You can look at the PDB format to see what fields are available in each record, or just look in the pdb.rb file which has a hash of all the definitions.

So, to get the title of a PDB file you would do it like this:

structure.record('TITLE').first.title

Some records have their own special methods:

structure.remark(num)  #Returns hash of the REMARK records based on 'num' structure.jrnl         #Returns a hash of the JRNL records structure.jrnl('TITL') #Returns an array of all the TITL subfields from                        #the JRNL records structure.helix(id)    #Returns the HELIX records based on 'id'                        #Returns an array if no 'id' is given structure.turn         #Same interface as '.helix' structure.sheet        #Same interface as '.sheet' structure.seqres(id)   #Returns the sequence given in the SEQRES record                        #as a Bio::Sequence::AA object structure.keywords     #returns a flattened lsit of keywords

And some methods are included for Bio::DB compatability:

structure.entry_id     #Returns idCode structure.accession    #Same as entry_id structure.definition   #Title structure.version      #REVDAT

How do I access the coordinate data?

Coordinate data is stroed in a heirarchy of classes with Bio::PDB at the top. A Bio::PDB object contains one or more Bio::PDB::Model objects which contain one or more Bio::PDB::Chain objects which contain one or more Bio::PDB::Residue objects which contain Bio::PDB::Atom objects.

There are two ways to access the coordinate data in a Bio::PDB object: keyed access and iterators.

Keyed access provides the simplest way to get to data if you know which residues or atoms you're interested in. For instance if you wanted to get hold of chain 'A' you can do it like this:

 chain = structure[nil]['A']

The nil refers to the model, which will be nil in crystal structures but may have a number in NMR structures. Every level in the heirarchy can be accessed this way. E.g. to get hold of the CA atom of residue 100 in chain 'A':

structure[nil]['A']['100']['CA']

or if you still have your chain object:

chain['100']['CA']

The residue identifier is not just the residue number. PDB residues can have insertion codes and even negative numbers.

chain['100A']['CA']

could be valid.

Iterators are also provided so you can easily go through all objects in the heirarchy. E.g:

structure.each{ |model|   model.each{ |chain|     chain.each{ |residue|       residue.each{ |atom|         puts atom       }     }   } }

Goes through every atom in the structure and outputs it. All the classes in the heirarchy implement the standard Enumerable and Comparable mixins as well.

Each class has a number of accessors to allow access to the data, most is in the Bio::PDB::Atom class:

  • Bio::PDB::Model has a 'model_serial' accessor only.
  • Bio::PDB::Chain has an 'id' accessor for getting and setting the chain id.
  • Bio::PDB::Residue has accessors for resName, resSeq and iCode.
  • Bio::PDB::Atom has accessors for serial, element, alt_loc, x, y, z, occ, and bfac

To find the x coordinate of the CA atom of residue 100 is then:

structure[nil]['A']['100']['CA'].x

How do I make a deep copy of a structure object?

You can't! There seem to be problems implementing a deep copy function in BioRuby. This is to be fixed in future versions.

Note to developers: have we tried the time-old hack of Marshal.load(Marshal.dump(obj)) to do the trick?

How do I measure distances and angles?

Methods for measuring distance and dihedral angles are provided in the Utils module. To measure distance, use the Bio::PDB::Utils.distance method:

res1 = structure[nil]['A']['100'] res2 = structure[nil]['A']['101']   distance = Bio::PDB::Utils.distance(res1['CA'].xyz,                                     res2['CA'].xyz)

Bio::PDB::Utils.distance requires two Vectors as its arguments. Bio::PDB::Coordinate objects, which are produced by the .xyz call to the Bio::PDB::Atom object, are Vectors. You can also provide a Vector directly:

distance = Bio::PDB::Utils.distance(res1['CA'].xyz,                                     [10,10,10])

And use the .centreOfGravity and .geometricCentre methods, both of which provide a Bio::PDB::Coordinate:

distance = Bio::PDB::Utils.distance(res1.centreOfGravity,                                     res1.geometricCentre)

All objects in the Bio::PDB heirarchy implement the centreOfGravity and geometricCentre methods.

Dihedral angles are calculated from four Vectors / Bio::PDB::Coordinates:

phi = Bio::PDB::Utils.dihedral_angle(res1['C'].xyz,                                      res2['N'].xyz,                                      res2['CA'].xyz,                                      res2['C'].xyz) psi = Bio::PDB::Utils.dihedral_angle(res2['N'].xyz,                                      res2['CA'].xyz,                                      res2['C'].xyz,                                      res3['N'].xyz)

How do I find specific atoms, residues, chains or models?

If the iterators and keyed access aren't powerful enough for you, the finder method is also provided. All the objects in the Bio::PDB hierarchy implement finder. It requires the class of object you wish to find as the first argument, and then a block which is evaluated for each object. If the block returns true then that object is added to an array of 'found' things.

For example, to find all the atoms within a box:

atoms_in_box = structure.finder(Bio::PDB::Atom){ |atom|   atom.x.between?(10,20) &&    atom.y.between?(20,30) &&    atom.z.between?(30,40) }

Or, to find all the HIS residues, in chain 'A' can be written like this:

his_residues = structure[nil]['A'].finder(Bio::PDB::Residue){ |res|   res.resName == 'HIS' }

or like this:

his_residues = structure.finder(Bio::PDB::Residue){ |res|   res.resName == 'HIS' && res.chain.id == 'A' }

Since the results are returned in arrays, it is very simple to do set operations, such as finding the overlap or difference between two selections:

sel1 = structure.finder(Bio::PDB::Atom){ |atom|    atom.residue.resName == 'HIS' } sel2 =  structure.finder(Bio::PDB::Atom){ |atom|   atom.x.between?(10,20) &&    atom.y.between?(20,30) &&    atom.z.between?(30,40) } sel3 = sel1 & sel2 sel1 = sel1 - sel3

Selection 3 now contains all the HIS atoms within the box and selection 1 contains all the HIS atoms outside the box.

How do I work with ligand data?

Because some PDB files reuse residue numbers that they already used in the ATOM records in the HETATM records, we have to add an extra flag when we want to access HETATM data. The extra flag is simply adding 'LIGAND' to the residue id. E.g Given the PDB file

ATOM      1  N   GLY A   2      -5.308  22.647  33.657  1.00 59.93   ATOM      2  CA  GLY A   2      -5.090  23.965  33.005  1.00 53.36 ATOM      3  C   GLY A   2      -6.209  24.197  32.021  1.00 52.56 ATOM      4  O   GLY A   2      -7.000  25.134  32.176  1.00 55.02 .... HETATM 2097  O8  PHT A   2     -18.122  40.603  18.146  1.00 16.14 HETATM 2098  O9  PHT A   2     -17.348  39.940  20.109  1.00 16.06 HETATM 2099  C10 PHT A   2     -15.622  41.753  16.662  1.00 13.34 HETATM 2100  O11 PHT A   2     -16.077  42.457  15.721  1.00 16.27 
Asking for
structure[nil]['A']['2']
is ambiguous, so if you want the ligand data you must use
structure[nil]['A']['LIGAND2']

There should also be an .each_ligand method in the Bio::PDB::Chain class for iterating through the ligands, but this is not implemented in the current cvs version.

How do I work with solvent?

Solvent is defined in the current parser as any HETATM record with HOH as the residue type. This may not be ideal for some files, but deals with 95% of cases which is good enough for me at the moment!

Solvent residues are stored in a separate Bio::PDB::Chain attached to each Bio::PDB::Model. Currently the only methods available are 'addSolvent' which adds a residue to the solvent chain and 'removeSolvent', which sets the whole solvent chain to nil.

An each_solvent method for Bio::PDB::Model has been implemented in my copy but is not in cvs at the moment.

How do I get the sequence of a chain?

There are two types of sequence in a PDB file: the real sequence of whatever protein has been crystalised, and the sequence observable in the structure.

The first type of sequence is available through the SEQRES record (if this is provided in the PDB file). You can access this through the top-level Bio::PDB object:

structure.seqres['A']

provides the sequence given in the SEQRES record for chain A. The observed sequence is obtained from the Bio::PDB::Chain object. E.g.

structure[nil]['A'].atom_seq

Both methods return a Bio::Sequence::AA object.




----------
From: Miguel Ortiz Lombardía

Hi Yuri,

To add something nobody else has mentioned yet.
If you are interested in statistics and analysis you may find
interesting the R package. There is an excelent module, called Bio3D for
the analysis of protein structure and sequence data. Have a look here:

http://bio3d.pbworks.com/w/page/7824467/FrontPage

I use R for a number of things. The learning curve is not that steep.
But may not be what you're looking for.

Best regards,


--
Miguel


No comments:

Post a Comment