Rockefeller

Projects

Laboratory of Theoretical Condensed Matter Physics

Siggia Lab
Publications
Projects
Software
Lab Members

Evolution

Fly Patterning

There are now upwards of 12 fully sequence species of flies covering a range of divergence time from D.melanogaster of 2 to over 50 million years. To explore the molecular evolution of regulatory sequence we have found D.yakuba is an optimal comparison for D.melanogaster, with D.pseudoobscura as an outgroup to distinguish insertions from deletions. Tandem duplcations are an important source of indels which account for more base pairs of change than do point mutations.

In another project with Ulrike Gaul, we have examined ~100 known and predicted anterior-posterior blastoderm patterning modules in D.melanogaster for those that show interesting variation with D.pseudoobscura. We find modules with duplicate expression domains are liable to change, instances where homologous regulatory regions (eg eve stripe 3-7) give different patterns, and cases of large (100's bp) deletions in known modules.

It is only within the academy that gene regulatory analysis and protein structure modeling are distinct enterprises; within the cell they are not. Do the regulatory proteins change their specificities between D.melanogaster and D.pseudoobscura, and can one predict the binding preferences of the mosquito homologues to fly proteins?

Antibiotic resistance

The kansas farmers who do not believe in evolution have only to consult their own records of pesticide usage to conclude that insects change to become resistant to chemicals that kill them and that these changes are inheritable. Evolution happens even more rapidly in hospitals in bacteria such as Staphylococcus aureus which is a benign inhabitant of the skin and mucosal surfaces but deadly when infecting wounds. Many antibiotics target a specific gene (eg rifampacin a component of RNA polymerase, .. DNA gyrase etc ). The bacteria then develop resistance by point mutations in the affected genes. Alternatively a substitute gene is acquired via a mobile element, as happened with mecA in S.aureus. Resistance to beta-lactams incurs a growth penalty, so hospital isolates are 99.9..% susceptible, but mutate at a low strain dependent way to the resistant phenotype. Resistance also requires 10's of auxiliary genes. These bacteria (so called MRSA) have also infected the popular press The next line of defense against MRSA is another cell wall inhibitor vancomycin, and resistance to it develops via a series of unknown genetic changes.

With the Tomasz lab at Rockefeller we are utilizing a number of approaches to sequence the entire genomes of closely matched pairs of susceptible and resistant bacteria. The comparison of mRNA expression data has not pointed to the relevant genetic changes, and we believe genes with pleiotropic effects are involved. The appeal of antibiotic resistance for evolutionary studies is the availability of money, a historical record, and a semi-natural context.


Gene Regulation

Mobydick

Motifs in biological sequence data can be defined as strings whose probability of occurrence greatly exceeds that expected for background. The problem is to decide what constitutes background and the natural limits on a motif since large enough pieces of a motif will themselves show up in a list of improbable strings. An algorithm to resolve both issues has been constructed by analogy with the statistical mechanics of disordered systems and has been usefully applied to decode all the regulatory sequence in yeast. Some of the output is given here .

The algorithm was tested on the eponymous novel by Melville. Random letters were inserted between the words and the result reduced to a string of lower case letters. The code was then asked to recover the english dictionary, (or the subset used by Melville, which was substantial). A sampling of the dictionaries that were created as longer and longer strings were searched is shown as plain text files.

Enteric Bacteria (E.coli and relatives)

There are intrinsic limits to what can be inferred from a single genome by probabilistic methods. The cell classifies sequence motifs with proteins whose DNA binding specificity we cannot calculate. Given only sequence, we have to cluster similar patterns together, which for sparse data is much harder. To circumvent this limitation, we do what the cell cannot do, namely compare the regulation of homologous genes from related organisms. Mathematically this provides more samples from the same distribution and thus makes clusters visible. Here is a compilation of Inferred E.coli Regulons.

There are approximately 10 sequenced species of enteric bacteria that are close enough to E.coli to share regulatory motifs. We have designed algorithms to measure how fast minimally constrained regulatory sequence evolves and then with respect to this rate quantified the significance of motifs that evolved less rapidly. The transcription factors themselves evolve at a rate determined by the number of genes they regulate. The results from our Genome Research paper are displayed here: E.coli Regulatory Comparisons

Gram Positive Bacteria

B.subtilis is the second most intensively studied bacteria, and it was of interest to apply the algorithms we developed for E.coli to it. Because of its proximity to B.anthracis, there is now a cluster of related genomes on which to explore comparative algorithms. More distant species such as the Streptococcaciae, and Staphylococcus aureus have become antibiotic resistance and are thus a serious medical problem but provide interesting data for evolutionary studies.

Patterning Fly Embryos

There has been a very productive convergence between evolutionary biology and development around the idea that most evolutionary novelty is due to changes in the regulation of existing genes rather than production of new genes. Our understanding of regulatory evolution will progress in tandem with better algorithms to recognize and parse regulatory sequence. In collaboration with Ulrike Gaul's lab, we are testing algorithms that enable us to identify cis- regulatory modules (~500 bp regions with multiple-factor binding sites) in the fly genome using collections of known binding sites. Alternatively, binding motifs can be found from intervals of sequences that are known to be functional. One key test is for the segmentation gene hierarchy, a prototype of combinatorial control where we have been quite successful in finding new blastoderm patterned genes and new binding motifs: Ahab.

More recent work uses both sequenced Drosophila genomes in the search and as a byproduct can screen for homologous regulatory modules that have changed between the two species. A more challenging task will be to dissect the regulatory cascade that gives rise to glial cells, a case where there is a known master regulator (Gcm) but with very few direct targets.

Budding Yeast

With Frederick Cross, we are using our analysis of gene expression in yeast to design experiments to probe the "grammar" of regulatory elements. It is seldom stated, but true in all cases we have examined, that most (75 percent) of the sites of the best-characterized factors in yeast do not imply expression for their cognate genes. Analysis of chip and more recently comparative sequence data is still far from providing a clear list of regulatory elements as detailed in a recent review.

A orthogonal approach to examining natural sequence is to construct random libraries from a restricted class of sequences and then assay them for function. The yeast "oracle" then pronounces which sentences are meaningful. One strategy for doing this, using a single sporulation specific activator and random linkers was recently published. Firm rules still did not emerge after sequencing several hundred random constructs equally divided between functional and dead sequences.

Another project examines the regulation of 20 genes by the well characterized cell cycle factors MBF and SBF using Northerns, immunoprecipitations, and deletion of the factors. There is a minimal correlation between the presence of the factor on the promoter and response to deletion.

Protein structure modeling

The dream of computational protein structure prediction has created a field of structure prediction that if realized could revolutionize the prediction of gene regulation. Models of protein DNA interactions fall roughly into two categories: empirical (relying on many examples of one family to fit the matrix of base-residue interactions eg the Zn fingers), and biophysical (still data intensive, since some comparison 3D structure is required). With a postdoc schooled in the trade, we have looked at the reliability of potential based predictions of DNA binding specificity. Based on multiple factors the most favorable ratio of output per effort was achieved by a careful structural based definition of the protein DNA interface and a simple counting of contacts to define specificity. Potentials were very useful in cases where the specificity derived in part from the variability ability of DNA to bend to fit the shape defined by the protein. An extreme form of 'induced specificity' is nucleosome positioning. Merely knowing which interface residues are changed relative to a reference structure (and which bases are contacted) furnishes a very informative prior on motif searches.

Biophysics

Cell Cycle and variability

On the list of basic biological processes, the cell cycle ranks must rank in importance just below basic metabolism. As a 'network', to use an atavastic term, the cell cycle presents a realistic mixture of transcriptional and post translational regulation (mostly the later) that does not fall within any existing bioinformatic category. A neglected aspect of cell cycle research over the past two decades, is its variation at the single cell level, which makes contact with earlier work on stochasticity in gene regulation. With Fred Cross, we are making movies as a single yeast cell grows to a colony of ~40 cells, using both phase and fluorescence microscopy. There are many markers available that record defining events in the cell division process. Yeast is a versitile system for 'noise' studies since it can be grown with variable ploidy. Part of this project involves custom image analysis and annotation software. We are also contemplating the feasibility of imposing time dependent perturbations on the cell cycle.

Cellular biophysics

Although the basic physics applicable to the cellular domain was understood early in this century, its utility in addressing "messy" problems has advanced considerably in the last few decades. Physics applied to cell biology is less reductionist than biochemistry. The challenge for the theorist is to deduce novel and quantitative conclusions from less than full chemical detail. The opportunities for doing so are enhanced when physics contributes to the experimental design rather than being added at the end to fit curves.

One very productive collaboration along these lines is with the laboratory of Jennifer Lippincott-Schwartz (U.S. National Institutes of Health), which uses green fluorescent chimeric proteins to follow various steps in protein trafficking and the maintenance of organelles during the cell cycle. Given a particular cell, we can simulate diffusion of any marker, such as a membrane protein in the endoplasmic reticulum (ER), and compare with photo bleach experiments on culture cells in vivo. The quantitative agreement between theory and experiment has been used to argue, for instance, that both an inner nuclear envelope marker and a Golgi marker reside in a continuous membrane system throughout mitosis. By contrast, the time course seen during the Brefeldin A-induced dissolution of the Golgi is not diffusive, and we speculate that it may involve a tension-driven flow, such as occurs during wetting or the spreading of surfactant on an interface. Our code for simulating diffusion in a inhomogeneous two dimensional system has been distributed to other labs. Ref 98.

Polymers/nucleic acid

Phenomological theories of polymers have proved very successful in explaining the mechanical properties of DNA, the shapes of supercoiled plasmids and the kinetics of reactions on these substrates. The morphology and mechanics of the brush-like chromosomes characteristic of meiotic prophase are also amenable to treatment. Another problem in polymer physics is the kinetics of RNA folding at the level of secondary structure. We derived expressions for the energies of pseudoknots (which cannot be treated by existing codes) in terms of known parameters, allowed overlapping stems and optimized them so as to calculate plausible saddle points between various topologies. Structures as large at the 400bp group I introns can be folded with plausible kinetics.