NSF Cyber-Infrastructure for Chemistry Meeting

From Chemical Informatics and Cyberinfrastructure Collaboratory

Main Session Notes

Marlon's Notes from the workshop described at http://www.oit.ucla.edu/nsfci.


NSF Chemistry and Biology CI Pre-Meeting Musings


(Some notes I took on the plane)

  • Need to develop lots of services. Difficulty is scaling, not sophistication. Simple systems become complicated when scaled to large sizes.
  • NSF should mandate that all online data sources they support should have web service (or REST equivalent) interfaces. Must avoid monolithic system design--monolithic systems are allowed to be implemented (like NIH pubchem) but they should be easily modularized, extended.
  • Must avoid building tightly bound desktop applications. All applications should have separable GUIs and engines.
  • Possible activity, "Build lots of cheminformatics descriptor services".
  • Should emphasize services that don't have sophisticated security requirements. This dramatically simplifies things.
  • HPC: should try to identify things that can be done by experts but with publicly available results. HPC will never be for the masses, best to use HPC resources in a super-SETI mode: set up calculations to be run beforehand, calculate results and make publicly available.
  • Keep It Simple: Best to avoid super-complicated XML formats developed by sequestered experts. Use community best practice instead.
  • Useful activity would be to define some message formats for WSDL: keep the WSDL simple and put sophistication in messages.
  • Online community building: what are the requirements here? What would people like to be able to do? This can be both synchronous and asynchronous.
  • How much work has been done on data warehousing?
  • Does NSF have mandate to build Cheminformatics deployment CI (as the TG is a grid deployment)? This is a long term commitment.

Welcome session notes


  • About 50 people. NSF CBET community connected to CI community is the goal.
  • Not a lot of CS or grid research. SDSC, TACC, me. Mostly chem/mech/petrroleum/elec engineering academics and industry folks. Various NSF folks. Not so many biologists? Also, this is not really NIH style cheminformatics. "Industry" here is not Lilly, but Dow and Dupont. More chemical engineering. Also, relatively heavy on industry. Lots of discussion on process engineering, "smart plants", supply chain management. Probably not enough people at meeting who know anything about web services, globus, condor, etc.

(Talk by Maria Burka from NSF)

  • NSF org chart seems to have some redundancy. CISE and CI, for example.
  • EFRI: expects to give 15-30 $1M awards in collaborative research. Letters of intent due in October. This is CI-like (dynamically reconfiguratble distributed infrastructure)
  • What is the Grand Challenge? What are the gaps? Have to work with industry. Have to be careful to avoid conflicts with industry.

(Bruce Hamilton, OCI)

  • Some discussion of the TG. Campus layer is the base of the pyramid. "Campus grids" referred to by Catlett?

Talk by Sangtae Kim


  • Dynamic data, bidirectional data flow. DDDAS
  • Some mumblings about "flat world" economics and analogies to CI.
  • Problem with pharm: breakthroughs in therapeutics not useable because of 1/1000 adverse side effects. Doesn't scale to billions of people,since millions will have major side effects.
  • Kim is from Eli Lilly. He mentions entire budget for middleware was 10M, advised by IBM to shut it down. Instead, doubled the budget. Amusing opinion of ibm. Lilly alone had $50M/year budget just for itself.
  • Success stories in predictive modeling of side effects?
  • Transition in energy is an example of a disruptive technology. CI will be important in this transition. Why?

Kirk Jordan, Advanced Computing at IBM


  • Talk mostly about Systems Biology (quantitative biology).
  • Modeling requires validation before it can be used by non-experts. Validation is a dull but important activity.
  • Focus on CI as a HPC activity. "We can now do overnight what formerly took 12 years." Not the full picture.
  • This is mostly a discussion of IBM BlueGene.
  • Mentions multicore, challenge is to do parallel programming on a chip instead of across many chips: I should see if anyone in the meeting does stuff in this area.
  • Comment: Implying that NSF should fund big machines. A: No, vendors already are taking a loss. Vendors are more interested in novel architectures. Scientists often drive this. Use application end users early to get requirements for chip design.
  • Stan's Comment: need to advise nsf on how to spend the money. $300M for hardware but not much for software, application research, humans, etc.
  • Jay: Just about to have more proposals for applications, etc from NSF. Not $300M worth of people to do petascale computing currently. Have a pipeline problem.
  • TG comment: the science gateways are working to do this.

Jim Porter, Dupont


  • This is mostly a managerial/business talk. Porter is primarily a lab-designer and lab-builder (or plant builder) for Dupont. Interoperability is a problem in facilities (intra and inter). This leads to waste.
  • Many 50+ year old people, who soon will be retiring. Need to keep this collective expertise online somehow.
  • Looks like they are doing some "home grown" CI. Need middleware for managing the data and information flow within their plants.
  • Mentions data security. Also, network security is important with online systems.
  • Q: what about the research side of things (discovery, etc)? How do you start the pipeline? They partner with future customers to do R+D.
  • Q: how to capture knowledge of aging employees? They are "downloading" employees before they retire. Not really specific on details. This is a classic CS AI problem of expert systems. They have a protocol for questioning the employees--a lot easier to pick people's brains as they retire.
  • Comment: data management is more important than computation.

Talk by Heinrich Braun, SAP


  • Supply chain management problems for internationally distributed supply chains.
  • Optimization and scheduling problems.
  • Calls parallel programming "grid computing"
  • "Local search agents" for parallel computing (?)--this is basically scheduling and dynamic resource binding, to bring computing resources to an assembly as required by a particular problem. Condor, PBS like problems?
  • Blade racks and pc clusters.
  • Mentions SOA, "information hiding"
  • This could be an impressive system.
  • Q: what recommendations do you have for rfid architecture? A: mostly focus on optimization work, so comments (very German).
  • Comment: Have been arguing against deterministic modeling and more for stochastic models. SAP is using a deterministic model.
  • Q: challenge with large decision models is extracting useful information. What is SAP's approach?
  • Q: Can discuss challenges and gaps in scheduling in supply chains? Midterm planning uses linear optimization. Midterm means daily or weekly periods. Short term means greater time precision. Many customers have problems with two-layer optimization.

Talk by Greg McRae, MIT


(This was a good talk, McRae seems to really know what is going on)

  • Talk is a meta-talk on how to do an effective workshop.
  • Main message is that chem eng needs to have better collaboration, holds up other fields (astronomy, genetics) as having better collaboration infrastructure.
  • PITAC report given as a good example: uses lots of examples.
  • Mentions only 35% of oil actually extracted.
  • Calls Top500 the "bane of the community". Growing separation of peak operations versus actual application performance.
  • Power consumption of large machines is also a problem.
  • "Earthmate" portable sensor
  • Advocates more funding for high performance algorithms, less for hardware.
  • Advocates multiscale computing, but this requires data integration of commercial applications (gaussian->tarcd or fluent cfd-> etc)
  • Mentions problem solving environments: workflow, notebooks, literature managers, etc.
  • Mentions community knowledge tools, collaboration etc (skype, google, wikipedia).
  • Issue for integrating these community tools is that they need programming interfaces and backend services.
  • Mentions need for agility.
  • Comment: data standardization and curation is a problem since it requires a long term commitment. Problem is that NSF itself does not have the infrastructure. Other parts of the government (digital libraries are an example).
  • Q: how to teach new students? A: At MIT, teach software engineering (with matlab) and not just programming. Extreme programming and code complete texts used.

Talk by Jerry Gipson, Dow Chemical


  • Industrial requirements and dilemmas. But what is industry's role in the NSF? I suppose industry partners are needed to a) avoid duplication and competition with academia, and b) show benefit to the tax payer.
  • Dow is moving to software providers (buy instead of build). Note also Dupont has same general strategy. The J&J solution of build all at home is unusual.
  • Security is important. Authentication of devices, not just people. Role based authentication (or really authorization?). No unknown devices allowed in the system.
  • Mentions SOA. Future plans.

Talk by Jay Boisseau, TACC


  • Focus on TeraGrid.
  • Tries to emphasize broader viewpoint than just HPC. TG is also about science gateways to simplify access and about open architectures (Globus, etc).
  • Mentions multicore. Probably good to assume that all the major computing centers are going to have some creditability here.
  • TG campus partners: underused CPUs on desktops, unique instruments. Their identity authentication infrastructure is the future of TG security: let campuses do this instead of TG.
  • Points out industrial partnerships: use TG as an external source for HPC to companies.
  • Not much funding going into programming tools, languages, and libraries. Should consider remedying this.
  • Comment: chemeng is under-represented in the academic HPC world. How to remedy? Most chem-eng stuff is done on PC still. So it will be important to simplify the access to engineering applications on the TG.

Afternoon session I


  • See other notes.
  • What are the community services that need to be built?
  • What are the requirements for real time data? Sensors, etc? How well does this map to SOA?
  • It is going to be really hard to filter all of these discussions.
  • Herceptin is a personalized drug. Increasing efficiency in pharma will help them target smaller communities. Note this would help the 99.9% problem (ie 0.1% with serious side effects not so important in smaller groups but doesn't scale).
  • Issue will be to define reasonable services.
  • Stan mentions DARPA has an effort on building the next generation of HPC language (or languages?).
  • 80% of cost in pharma is phase III clinical trials.
  • Also, variability of level of active ingredients in drugs can be a greater problem than the susceptibility of 0.1% to a drug.

Afternoon session II


  • See other notes



Tuesday Morning Session


(Comments by Jim Davis

  • Opportunity for Chem Eng CI. NSF CBET Community is looking to work with CI.
  • "Porfolio Concept"
  • Need to tie to TeraGrid
  • Chem eng connection to industry is a hindrance to adopting CI.
  • Stan: "supercomputing is not necessarily heroic." Also, close connections with industry are a good thing--will create closer connection between industry and CI.
  • NSF funds things 10-15 years away from industry adoption.
  • Industry partnerships important for the teragrid. Not all groups will want to do this full time. This is untapped.
  • Power consumption of clusters will drive smaller groups to stop building large clusters. Universities need to plan. Can't just have unplanned clusters popping up. This drives the importance of multicore, btw.
  • Comment: how do you explain the benefits of HPC to non-HPC people in concrete ways. For weather simulation, for example, what can you do with one PC, what kind of problem can you do with 10 nodes, 100 nodes, etc. Is this a metric?
  • Weather.com example: don't need a supercomputer to use weather.com, but weather.com needs a supercomputer

Breakout II recap


(These are mostly summarized by slides)

  • Economic impacts: is there need for computational outreach to smaller companies that don't have the necessary expertise? Or do software vendors already do this? Or do small company computational experts contract with small company lab-driven groups already.
  • Need to make a TeraGrid presentation at some Chem Eng professional meetings.

AICHE is one group. ISA is a subgroup of this (?) (information systems automation) that is a good industry connection for CI. They also do training and education.

  • Sensor networks are important to this group in general (process control, for example).
  • Education: CI software could be used in advanced undergrad and grad software. But the Chem Eng faculty will not be qualified to teach. So will need to couple with local CS department or perhaps do this through distance learning.

Tuesday morning session wrap up


These are points from an "op-ed" by Vince Grassi.

  • State of the union: CI will happen. OCI will spend $1B, so should participate. Also, CBET has a lot of problems, so these should be put in CI terms.
  • Call to action: should organize problems around themes: loose versus close integration, smart plants, modeling, information modeling, large scale planning. Need a portfolio of themes: identify a few megaproblems as well as smaller problems. Finally, need to consider forming subteams to work on portfolio problems.
  • Think differently: "Build it and they will come" WILL work in this case. Also, the community is missing the "soft skills" of marketing themselves.

Session IV Notes


  • TG is a good forum for building collaborative communities. Need to explore technologies to allow emergent science communities.



Session IV Breakout Notes

Session IV Notes

  • Generally, what do we want to tell the NSF? How does CBET connect to OCI?
  • Message from NSF: think big, consider mega problems that need $M of dollars in funding to solve. NSF is accountable to Congress, so have to come up with challenging problems.
  • Enabling multiscale problems:

- must word in terms of national priority?

  • Advance scientific knowledge that are hard problems for industry to solve.

- Specific problem: US supply chain of chemicals and petroleum. - Possible angle: homeland security. Can CI provide protection and redundancy in case of natural, man-made, and cyber disasters? Where should chemical plants be located? Within US? Outside US? Research on making cleaner, safer plants.

Q1: Should there be a CBET CI community? - Yes - Problem is to identify the research problem, high end ares. - Presumably there is a bootstrapping process: ultimately need to identify general areas for future calls for proposals. - Industry represented by ISA, AICHE, Council for Chemical Research. These are good groups for making CI communities.

  • DDDAS group from CISE is umbrella group for coupling dynamic data with computational research.

Q2: What are most promising enabling areas? - What really is multi-scale? From industry: simulations at low fidelity feed into higher fidelity system. - Chem eng has to deal with the entire scale, from molecules to chemical plants. - System level integration - Need to map multiscale problems back to what CI can do now. Most CI middleware work has been to support "loosely coupled" connections of services into workflows. Some of these problems exists for CBET community, can be tackled with current CI tools (possibly). Multiscale also can be tightly coupled, doesn't map to current CI software research into web services and workflow, so need to get these requirements back to the CI community, collaborate with community.

  • Sensor grids and networks:

- Sensor grids are currently too expensive for complete deployment. Labor intensive to install, must be gorilla proof.

  • Application: Bio-molecular systems analysis is a multiscale problem, modeling from proteins to cell structures to cells to multiple cells to .....

Rank Order List of Problems with the Smart Plant


  • Sensors and sensor networks: crucial for safety of equipment and optimization of processes. But currently too expensive. Need to be either cheaper or more mobile. Research challenges: physics of sensors to make smaller and cheaper; make wireless networks work reliably and securely in metal factories; resolve international interoperability problems. Data management: massive increase in raw data into modeling and control systems, so need research into new algorithms for data processing. Also, detecting pre-cursors to abnormalities--early event detection. Fuzzy detection: "something may be happening, perhaps operator/human brain should investigate" but risk of false positives. Human training and feedback important--different operators have different capabilities and experience ("context problem").
  • Cyber-security

- Must be able to address response as well as initial security. - Important for industry given the potential physical danger for security breakdowns. - Sensors are important part of cyber security, detection of breakins - But sensors are too expensive still. - Can't hamper operators in emergencies, but must also provide security in general. Must balance between these incompatible problems. - Who works on sensors in CBET community? Mostly vendors.

  • Interoperability
  • Software quality: software engineering important to CI, gets more difficult as you scale up to large grids. Also, CI for operating systems is difficult. Real world example: MS patches must be applied to 24/7 systems. CI software robustness, reliability, dynamic maintainability are important to the CBET fields.

APPLICATIONS

  • Application: Bio-molecular systems analysis is a multiscale problem, modeling from proteins to cell structures to cells to multiple cells to .....
  • Application: beta of natural gas is very large (ie high price fluctuations), consider the results of Katrina and Rita. Supply chains are very rigid, fragile and so susceptible to breakdowns. This whole system is worth modeling.
  • Application: Smart plants. These are smart supply chains at smaller scales, so can be modeled as a research challenge. But also they represent hazards not in the general supply chain problem (ie things blow up, hazardous chemicals are released, etc.)
  • Application: can you simulate plants to avoid building pilot plants? Virtual pilot plants. Much harder for chem industry than traditional manufacturing. Problems occur when you move away from simple chemicals. Gortex for example has never been modeled, could only be done with pilot plants. Nonlinearity and non-scalability when moving from test tubes to production plants is a grand challenge for building virtual pilot plants. Need fundamental breakthroughs. Industry uses art instead of science. Science 50 years behind art in some cases.
  • No time to discuss, but important: energy, conservation of resources, less waste generation, less environmental impact. This is also part of supply chain systems research.
  • Zero incidence: this is part of the smart plants research challenge area, important to integrate sensors