Oct 6 Notes

From Chemical Informatics and Cyberinfrastructure Collaboratory

CICC Meeting Notes Oct 6

  • I'll assume all slides are on the wiki.

  • Presentation by David Wild:

  • Main theme #1: who are the customers? HTS people are one group. Scripps is the main group here. GCF mentions Peter Cherbas and his group. Head of CGB. Mookie: workflow that we already have would be directly applicable to work done by Samy Meroueh.
  • Samy: Scoring function research done by python interface, libraries for connecting mol mechs calculations, entropy calculations, force field calcs.
  • Semantic work: GCF asks if PMR is doing this. Oscar is part of this larger project. Need to examine what they are doing. Natural language processing part of the work they are doing. Not overwhelmingly successful in general but may have a chance in chemistry. Bobbie from the Cambridge group will be knowledgeable of this.
  • Ajay wants pubchem and related ontologies. Rajarshi: Ajay has some set views, may not match reviewers' biases.

  • Comments from Melanie

(See David's last slide on DTP).

  • First goal is to evaluate methods for data mining, second goal is to use the best method for discoveries. To do the first, must compare to a sample data set for validation.
  • GCF: must be able to state in proposal improvement, since this general activity is very common. Mookie: what are chances we will get a paper out of this by end of year? Don't want to explain this in detail, only have 1/2 page in proposal. Must therefore have a paper to cite here. David: will have this submitted. Need a result to quote on how we improved chemistry in some way. Show how to open doors to new science. Very promising and interesting technique, but must be able to back it up. David: claim 1 is that we can scale up the size of problems that can be mined . GCF: not allowed to put computer science in this proposal unless it directly contains chemistry.
  • Mookie: if he was reviewer, first question would be how do you know what genomic data to use? Faming: we provide guidance on the proteins. Mookie: sexiest part of work is already done. Docking workflows already most important. Run risk of being too "ambitious" in proposal if discussing extending docking workflows to datamining.
  • Geoffrey: this is more likely to get good results in datamining than in natural language processing. We have more expertise in all parts: chemistry, data mining, high perf. computing.
  • Melanie: how to position for proposal? GCF: pour resources into this, get a really good result. If you get good results, we can make this a highlight of proposal, but must get a good chemistry result, not just a computer science result.
  • Mookie: need a clear target to achieve by December for this work. GCF: find something done badly in the past and improve it.
  • Melanie: we will adopt traditional datamining algorithm and show how we can do much bigger problems than current matlab/excel based work done by other groups.

  • Talk by Mookie

  • Q: Is Michigan a competitor? No, we are allies. NIH recommended we work together.
  • Our differences from other centers: we are only ones with computational chemistry, education, and distributed systems expertise. Other groups are more chemical informatics focused.
  • We can tell NIH we are not a collection of R01 grants.
  • Comment by David: industry does workflows and services now (with pipeline pilot) but we are trying to be broader and open ended.
  • Mookie: Pubchem fits as a piece in a workflow, federated with other dbs. GCF: must demonstrate this, can't just state in the proposal.
  • What do we do for a registry of services? Do both Wiki with text descriptions and UDDI. Mookie: also have a taverna workflow associated with each service (or collections of services). Taverna workflows should also be described in text.
  • caBIG brought up. Enormous amount of work building ontologies and associated tools. Any links to this project would be good.

  • Big Red Demo

  • GCF: BR has several political important implications. Must have good work done on it by SC06.
  • Jake: have workaround for Aug 22 OSCAR failures. Need to verify that this will work on all 2005-2006 abstracts.
  • Rajarshi: have 20 years worth of abstracts, should do this next.
  • Jake: also oscar confused space group with a molecule (c2).

Shouldn't worry, later filters will catch these kinds of mistakes; oscar doesn't need to be fixed.

  • Smile to SDF conversion: Openbabel doesn't do this. Open Eye and Kevin's codes are only ones that can do this.
  • How many unique smiles? 6,000 for 1 year. Rajarshi: about 3,000 of these were in in both pubmed and pubchem.
  • How many papers can we download from ACS (full text not abstracts)? Need 100 random papers to see the ratio of compounds in abstract to compounds in the paper. Mookie: can do 10-25 papers per day easily.
  • What is status of docking? OE code should run, but have not yet gotten the executable.
  • What is the next step for the proposal, SC06, etc? The current version + docking will take < 1 day on BR. Would only take a small cluster 1-2 weeks, which is not adequate.
  • We will use Jaguar as the QM code in the next step. Data then naturally goes into Varuna. Mookie and Kevin will start next week.

  • Talk by Samy Meroueh

  • He is doing computational scoring of docking.
  • Workflow is the key: how to bring together components for docking, scoring, etc.
  • GCF: how do you best fit in with our project? A: can provide components for web services for scoring. Can extend the docking calculations to do more interesting scoring research.
  • Samy and David should write a summary "proposal" for the wiki.

Comments by Malika


  • Discusses her web interface work with students, which is being also used in other IUPUI work (drug discovery project). This work will be part of our education focus. Need to collect some metrics, numbers, testimonials for this work.

  • IUPUI presentation

  • Have tools for clustering of proteins, scoring, other tools. These can be web services.
  • Binding site prediction service would be one thing to do.
  • Other service: tertiary prediction service. Search literature for sequences as well as compounds.
  • Also, protein-protein docking.
  • GCF: write a proposal of work to do and put it on the wiki.

  • Talk by Rajarshi

  • Local pub chem: has ~ 5 sql query types, easily could be mapped to WSDL.
  • GCF: there is a HEP program called root that does histograms, scatter plots, etc. Can have this a a web service. Would be an alternative to R.
  • Need to contact Kerby Shedden about combining his work with our R work (Rajarshi and Sima).

  • Talk by Sima

  • Need also an excel to votable service. We have a votable to excel converter.

  • Talk by Jake Chen, IUPUI

  • How do we followup? Need to see the overall plan of the project.