GuhaTriptoScripps

From Chemical Informatics and Cyberinfrastructure Collaboratory

There were two main goals for the trip - make plans for collaborating on the toxicity data generated for the MLSCN compounds and second, look at things we can work on in terms of WS's and infrastructure.

One of the questions that I went with was what is Scripps expecting from the collaboration? It turns out the Scripps is an MLSCN - and this includes Scripps, FL (which has the informatics and robotics) and Scripps, CA (which has the chemistry). Furthermore, it seems that Brad Ozenberger and Ajay (?) pointed out to Scripps, FL that collaborations with ECCR's and other MLSCN's would be beneficial. He also mentioned something on the lines that NIH expected to see collaborations - but I'm not exactly sure about this aspect. Another aspect wrt collaborations was that Stephan does not think that purely focusing on descriptor and algorithm development is of much use. He would rather work with tools and workflows to get information out of data - especially, since a lot of new (MLSCN) data is being generated and has not been studied.

So this explains why Scripps, FL would be interested in collaborating.

Regarding plans for the collaborative work:

1. Data mining/modeling: Stephan has access to the Toxnet database (100K compounds, with LD_{50}'s and structures), Leadscope tox DB and the MLSCN dataset (~66K) and cytotox dataset (3K, secondary screen data). We will most probably be focussing on the MLSCN dataset (secondary screening data is expected sometime soon), though the cytotox data is available if we want it. There are two lines of investigations:

     * Summarize/review the MLSCN tox data (which is primary HTS data)
       with respect to the larger DB's. This would include stuff lik
       looking at the distributions of compounds in fingerprint and/or
       descriptor(BCUT) spaces, clustering with a view to finding
       compound classes/scaffolds, initial crude bayesian models to get
       an estimate of toxicity, categorization of the ToxNet/MLSCN
       datasets (thus performing classification rather than regression
       - initial results with a RF model indicate that classification
       seems to work pretty well), look for fragments that are
       indicative of toxicity (a starting point are a subset of the
       Leadscope fragments). he'd also like the fragment list from
       ToxTree. Also an initial approach using kNN seems to work better
       than PLS regression etc. Look into this, using different metrics
       (dist, fingerprint similiarity etc) and consider RNN as opposed
       to absolute NN's - using different cutoffs might allow us to
       provide some sort of confidence/reliability in predictions. Also
       look at stuff that is in the MLSCN but not in Toxnet. Also need
       to look at cutoff values when trying to categorize the datasets
       or when performing nearest neighbor predictions 
     * Build more sophisticated predictive models (RF, SVM etc). Also
       consider the use of tox indicating fragments described above as
       a measure of model applicability. Also look at decision trees to
       derive 'lines of reasoning' for why compounds are predicted as
       toxic

(One important aspect is that he stressed on chemical information as the end result of the modeling. So he'd like, in the end, to see stuff like toxicity flags based on substructures, explanations of why a compound is toxic and so on)

One interesting outcome of the review as well as more sophisticated modeling process might be the ability to suggest alternative assays. That is, say we are performing a RNN prediction, but the cutoff only leads to a few (or zero) neighbors - such a prediction is probably not reliable. It might be useful to suggest an alternative assay that could be used to perform the prediction. This will require measuring correlations of tox values for a set of compounds over a variety of assays - at this point I'm not sure how many such compounds there will/would be, but this is probably a little more long term compare to the review/modeling described above.

With regard to access to data, Stephan says it should not be a problem for him to give us the data dumps from Toxnet etc, as long as we don't distribute them.

Regarding tools to do this - some of the stuff he'll have to do, as Spotfire is very slick for the 100K molecule dataset manipulations. So stuff like the initial clustering to get a set of 'compound classes' will have to be done in Spotfire (though we can also do it with paralell k-means and see what we get out of that). However other stuff like looking at RF/SVM models, kNN/RNN protocols etc can be done at IU.

[ Also since Pipeline Pilot uses R, it would be interesting to see if we build an R model at IU whether Pipeline Pilot can do anything with that (or the reverse - export an R model from Pipeline Pilot) ]

2. Web services - we look at getting Pipeline Pilot to hook into our WS's via the published WSDL. And it works quite easily and well. We got the molecular weight, formula and Toxtree services running as a Pipeline Pilot component and Stephan liked it :)

In this area the scope of collaboration does not seem to be as broad as in the modeling area. I asked what type of services would Scripps be interested in, to hook into their workflows. Stephan suggested the following:

     * OSCAR services, oriented towards patents (though the idea of an
       OSCAR parsed abstract DB was also good). Stephan would be very
       interested in this and could also help out if required. He noted
       that free (or minimal charge) patent DB's are available which we
       could process. At the same time, Stephans opinion is that this
       is a somewhat longer term project.
     * Toxicity services - ToxTree is one example, others would include
       alternative predictive model services etc, which leads to the
       next service: 
     * A modeling service - essentially send a Y vector and X matrix,
       and run it through technique Z and get back predictions. He
       thinks this would be very useful. Another specific modeling
       service he mentioned was a feature selection service.


So the above three points suggest that we should start on building up the PubMed abstract/SMILES database for multiple years, as well as start looking at the patent side of things. I will also get started on the R web service - an initial service (OLS or LDA) should be fast enough to set up.

Stephan (as well as some other people here) are curious as to how they can 'discover' available WS's. One person mentioned BioMoby ( http://www.biomoby.org/ ) as a resource for bioinformatics services - there is no comparable service for cheminformatics (not surprising due to the small number of cheminformatics WS's). I pointed out that we have a list of services at IU as well as at some other places. It might be useful for us to 'publicize' ourselves as the central point for cheminformatics WS's - if people provide WS's elsewhere we could maintain a catalog. If we could extend this catalog from a static HTML page to something like a UDDI server, that might be very useful, though I'm not sure how feasible this is. It also might be useful to take a look at BioMoby and either supply our list to them, or see if we can set it up ourselves and focus just on cheminformatics WS's.

I also think we need to enhance our service offering with a lot of small 'atomic' services, which could be as simple as exposing a number of CDK classes (hydrogen addition, hydrogen removal, ring counts etc)

On the issue of wrapping stuff on the Scripps side as WS's he thinks it's possible, but since they more or less do everything within Pipeline Pilot and Spotfire, he doesn't exactly know the mechanics of how to set this up. More importantly, he'd rather wait till he has some good predictive tox models in PP/Spotfire to actually start wrapping them up as WS's. Given models he'd be willing to set up WS's on his side - though how we'd go about helping with PP/Spotfire I'm not sure (they're pretty expensive IIRC)

I mentioned that I'd be starting a wiki page for the collaboration, Stephan agreed that was a good idea. Should I just start a page from somewhere on the Chembiogrid wiki? I'll provide a more organized version of the modeling and WS work on the wiki page.

So overall, it looks like Stephan would like to work on modeling/characterizing tox data in the short term and get something out, but is also quite interested in the WS stuff (especially OSCAR wrt patents) but considers that on a slightly longer term.