Rockefeller

Software

Laboratory of Theoretical Condensed Matter Physics

Siggia Lab
Publications
Projects
Software
Lab Members



Photobleach Recovery

Purpose:
This program was written to simulate diffusion in a two dimensional isotropic but inhomogeneous media. It is has been productively used to follow photobleach recovery in whole cells for both soluble and membrane confined fluorescent markers. Soluble markers can be inhomogeneous because they are excluded from the nucleus or the cell is variable thickness and is being viewed in projection. For endoplasmic reticulum markers the inhomogeneity is obvious in the image. The code assumes a continuous media, so is only appropriate if bleaching on scale much larger than the spacing of the ER tubules
Coding:
Done in C, rather unprofessionally, with a clutzy interface to read the tiff images from the microscope
References:
method
Availability:
Rockefeller does a free academic license (material transfer agreement in legal parlance), do email EDS. Source code provided.

Dictionaries for Genomes

Purpose:
The program 'MobyDick' was developed to find find over represented words (strings with a few degenerate base symbols) in samples from 10's of kb to several mb in length. Its a maximum likelihood method, and assumes that the data is generated by picking words at random from a dictionary with their frequencies which are the fit parameters. Words are added with increasing length when the current model under predicts their frequency in the data. Shorter words may disappear as a result. Empirically, the final solution is unique or nearly so. Convergence is improved when long exact repeats are removed with the REPuter program. A nice illustration of how words are built up from syllables is provided by the recovery of an English dictionary from the novel Mobydick reduced to a long string of [a-z].

The authors have experimented with converging entire weight matrices with this algorithm and then multiple minima are a fatal problem. The algorithm is completely hopeless for modeling sequence where the information is the absence of a few words in otherwise random sequence. A plausible extension would be to segment

Coding:
Intelligently done in generic C with various Perl scripts to tie things together. The script for generating RegEx words is greedy and a bit klutzy.
References:
method , application .
Availability:
Email EDS and get a tar ball with source. Or try Hao Li's web site here .

Ahab and Stubb

Purpose:
These are two descendants of MobyDick, (we skipped Starbuck on the crew list to not confuse the coffee drinkers). Both programs fit a set of predefined weight matrices to data. We do not use Ahab any more because of some coding infelicities and expanded functionality in Stubb, which handles multiple species with an evolution model that we have used in our PhyloGibbs code, and also fits and scores positional correlations between motifs.
References:
Basic algorithm for Ahab and, Stubb , and applications to fly, one or two species
Availability:
Email EDS for a tar ball with Ahab, for Stubb source code, contact S. Sinha . There is also a stubb web site , (which does not implement the motif correlation feature however). There is also useful graphical depiction of binding sites within a module called 'windowfit' at this web site. For genome wide runs, you have to run the program locally.

Clustering motifs

Purpose:
Assume you have found a few thousand 'mini-weight matrices" by running some motif finder on orthologous sequences from a collection of genomes. How many different motifs are represented and how does one find the composite weight matrices? Clustering is accomplished by sampling the distribution of all possible ways of generating the mini-weight matrices by sampling an unknown number of unknown weight matrices. There is no need for ad hoc similarity scores we implement a Bayesian model for hypothesis that the data derives by sampling independent weight matrices. In the authors humble opinion there are no competitors for this task. A weakness is that motifs of substantially different widths are not treated.
References:
The original application to E.coli .
Availability:
Contact van Nimwegen for the PROSCE code. It was professionally coded in C, C++.

Motif finding in related species

Purpose:
We have implemented a Gibbs sampler that uses the Stubb model for motif evolution. It will correctly weight data when several species are close and others more removed, from the common ancestor. It will also realistically assess the significance of the motifs thus found. It has been described by some early users as a 'tank' it crushes most applications, but with some startup costs, and its not fast, (but a lot quicker than an experiment!). The same evolution model, but searching with maximum likelihood is available in the PhyME code. The code allows the use of prior information about the motif, such as would be obtained from a structural model.
References:
With application to the yeast ChIP-chip data .
Availability:
Contact the primary authors, Erik van Nimwegen or Rahul Siddharthan . There should we a web interface available.

Protein structure and DNA binding specificity

Purpose:
A web site to automate the task of finding the closest known co-crystal structure for an arbitrary transcription factor and determine its binding specificity according to our contact model. Coming soon (8/06).
Reference:
The origional paper and others in preparation.