Model Exchange
From Chemical Informatics and Cyberinfrastructure Collaboratory
Contents |
Model Exchange
Goals
The goal of this project is to provide a standardized and cross-platform document format for the exchange of models. In this context, a model is generally defined as a function that accepts input data and a set of parameters and will provide a prediction (of arbitrary type). This is a very abstract definition but takes into account a variety of things that one can call a model.
The most familiar model would be statistical model (such as a QSAR model). Examples of these models include linear regression, neural networks and so on. The aim of these models is to take in a set of descriptors (real-valued or categorical) and predict one or more properties (which may also be real or categorical). Note that such models are fully specified by the parameters of the model. Thus for a linear regression model, one need only specify the regression coefficients to be able to rebuild the model. In the case of a neural network, one must specify all the weights and biases as well as the type of transfer function to fully specify the model.
On the other hand one can consider a docking model. At first sight, this seems to be different from a statistical model. But fundamentally they do the same thing. Thus a docking model will accept a protein, its binding site and a ligand along with the parameters required by the docking program and will predict how the ligand docks into the protein. Thus the model is fully defined by protein binding site and the arguments for the docking program. Its input would be a ligand structure and its output would be a (set of) ligand poses.
The characteristic feature of model exchange is that it will involve exchange of model specifications. That is, any proposed document format will simply be a description of the model. On its own the document can do nothing more. It is the responsibility of the consumer of such a document to actually use the included parameters to rebuild the model in its local environment. Thus if we consider the exchange of an OLS model, a consumer such as R, would load in the document and generate an object of class "lm" using the coefficients stored in the document. In the case of a docking program, one would essentially have to have the docking program itself to regenerate the model - using the document as a parameter input file.
Thus model exchange will occur on the level of model specifications and not model implementations.
Possible Approaches
Given that conceptually one can group many things under the 'model' umbrella, it is clear that any document format that is considered will have to be very generic. However it is clear that though the types of data required to specify a statistical model and a docking model may be similar the semantics of the parameters will be significantly different.
Given this situation an XML based document would be attractive for a number of reasons, though the primary reason would be extensibility. Thus, rather than defining a single document format that would try to encompass everything, the proposed format would rather simply define a set of model types (say, StatisticalModel, Docking Model) and the description of the individual model types would be defined in terms of totally separate XML documents. That is, the proposed document format would be a container, allowing us to reuse current formats (or create new formats) for models from different domains.
More generally, one need not even specify a set of model types, but rather just include namespaces for schemas that are defined for models from different domains. Thus one might include the namespace for a statistical model schema and a namespace for a docking model schema.
A good example of this approach is that of RSS feeds. These are XML documents whose schema define a few elements, allowing us to specify items in a feed. However each item can include arbitrary XML documents via namespaces. One example of such an "extended" RSS document is a CML-RSS feed, which embeds viewable chemical structures (using CML) in an RSS feed.
In the following sections we describe possible formats for models from different domains.
Document Formats for Subdomains
Statistical Models
One possible document format for the exchange of statistical models is Predictive Model Markup Language (PMML). The format is based on XML and is designed by the Data Mining Group (DMG) which is a consortium of industrial (Microsoft, IBM, Oracle, SAS) and academic (U. Illinois) groups. The schema for the format is publicly available.
Overview
Briefly, PMML allows one to serialize a statistical model (of various types) to an XML document such that the document is a full description of the parameters and data used to build the model. Thus one can describe the variables used to build the model in terms of names, types (real, categorical), ranges (minimum, maximum values) as well as include the actual training data used to build the model (for the purposes of model verification).
A brief summary of the benefits of PMML described by [www-users.cs.umn.edu/~kumar/Presentation/M7-dm-chap10.pdf Grossman] et al is listed below
- Open standard for data mining and statistical models
- Not concerned with the process of building a model. Rather it describes what is required to build the model, given the input data
- Provides independence from applications and operating systems
- Simplifies the use of statistical models by other applications
- Allows a simple means of binding parameters to values for an agreed upon set of data mining models & transformations.
In addition to the above, the fact that PMML is an XML based markup language with a well defined schema, a number of other benefits are available:
- PMML documents can be checked for conformance
- Since robust and efficient XML parsers are available for a variety of languages, parsing a PMML document is not a significant problem
- PMML documents can be transformed using XSL to other formats such as HTML.
However it appears that due to the way the PMML namespace is implemented one cannot include fragments using other namespaces within a PMML document. Thus, one could not include for example, a chemical structure in CML
PMML Features
A PMML document can be divided into a number of sections such as header, data dictionary and so on. Rather than provide a detailed description of individual sections the reader is referred to the specification. In the following text I provide a brief summary of features that are relevant for the current requirements for model exchange.
A PMML document can store multiple, named models, thus allowing one to choose a specific model, or else use all enclosed models (such as in the case of ensemble models). Furthermore since a given model type can be used for multiple tasks (e.g., a neural network model can be used for both regression and classification) PMML defines an attribute, functionName, which identifies the nature of the model. Its values can be:
- associationRules
- sequences
- classification
- regression
- clustering
The PMML specification currently supports a number of specific model types listed below.
- Association Rules
- General Regression
- Naive Bayes
- Neural Network
- Rulesets (similar to a flattened decision tree)
- Tree models
- Support Vector Models
PMML defines a number of data types including integers, reals, probabilities (a real between 0 and 1) and percentages (real between 0 and 100). Support for arrays (both mixed and pure) and matrices are also included.
An important aspect of model exchange is a precise definition of the fields. Fields are defined by their names as well as their types (ordinal, categorical, continuous). Furthermore there are two possible sources of fields in a PMML document. One class of fields is the fields that are directly input to the model (termed MiningField). The second class of fields is derived fields. That is, PMML allows one to include a field which is derived from on of the first type of fields. The derivation might be a simple transformation (say converting to a log scale) or something more involved. Furthermore, PMML itself defines a number of simple transformations such as discretization, aggregation and so on.
Two important components of a PMML document are the DataDictionary and MiningSchema. The former defines the various fields to be referenced in specific models. Thus, the DataDictionary can be used as a global data store. The MiningSchema on the other hand, is specific to a single model and defines the fields that are required for the model to be used. Furthermore, information regarding the nature of the fields (e.g., predicted, independent variable), the importance of the field can also be included. In addition information about minimum and maximum values can be stored, which can be used to mark outliers or as a simplistic means of domain applicability.
As was noted above, PMML does not appear to be able to include arbitrary namespaces. However, the specification does provide what is known as Extension elements. These are sub-elements of any PMML entity, that allow one to extend that entity with more functionality. One of the examples is to specify a format of a certain field to be of the form %3.2f. Clearly, when extension fragments are included their meaning must be well defined (by producers and consumers of such documents). However PMML itself makes use of extension entities to do a variety of things.
A PMML model can also contain a set of univariate statistics which can be used to described individual fields
As noted above a PMML document can have multiple models. These may be accessed individually, but can also be used in an ensemble framework via the notion of model composition. Model verification is also handled by PMML. This is in contrast to document validity which is just a check that the XML is valid. The aim of model verification is to include a subset of the training data, such that when the model is regenerated by a consumer of the document, the results of the model can be checked with those stored in the document.
Product Support
PMML is supported by a number of commercial products such as http://www.spss.com/clementine/ Clementine] and CART. More details can be found here.
In terms of open-source support, a package is available for R that implements a subset of PMML. Currently the package has support for linear regression, decision tree and random forest survival models. Work is in progress to include support for more model types. Furthermore, it appears that the package currently focuses on model export and does not allow importing of models from a PMML document.
PMML Example
I provide a minimal example of a PMML document representing a linear regression model
<?xml version="1.0" encoding="UTF-8"?>
<PMML version="2.0">
<Header>
</Header>
<DataDictionary numberOfFields="3">
<DataField name="X1" optype="continuous"/>
<DataField name="X2" optype="continuous"/>
<DataField name="Y" optype="continuous"/>
</DataDictionary>
<RegressionModel modelName="MyModel" functionName="regression" modelType="linearRegression" targetFieldName="Y">
<MiningSchema>
<MiningField name="X1" usageType="active"/>
<MiningField name="X2" usageType="active"/>
<MiningField name="Y" usageType="predicted"/>
</MiningSchema>
<RegressionTable intercept="5.363738346373">
<NumericPredictor name="X1" exponent="1" coefficient="43.5336464636353"/>
<NumericPredictor name="X2" exponent="1" coefficient="12.5563837646463"/>
</RegressionTable>
</RegressionModel>
</PMML>
The model has two independent variables and a single dependent variable. Given the coefficients for the independent variables and the intercept, one can easily rebuild the original OLS model in any environment. However in addition to rebuilding the model (and then using it to make new predictions) one can transform the PMML document to some other format using XSL. An example of such a transformation can be viewed here
Disadvantages
- PMML apparently does not allow inclusion of namespaces, though this needs further investigation
- The above discussion does not address the issue of descriptors. That is, currently, there is no standardized way to specify what descriptor was used due to variable naming standards. Furthermore, without a standardized way to access descriptors one cannot use a PMML specified model for new predictions.
Projects
The above description of PMML appears to indicate that it does provide a lot of what we need for the exchange of predictive statistical models. Some of the areas in which the ECCR's can contribute (if PMML is chosen) include
The best way to show that this works, is to have ECCR's build models and then exchange them using PMML. However this will require some work (unless we work in R and use a decision tree model) before this will be generally doable. So more specific projects will include
- Improve support for PMML in the various platforms that the ECCR's are using.
- For R users, it would make sense to work with the author of the pmml package. He is happy to accept contributions and is also working on increasing support for more model types (personal communication)
- On a related note, support for import is required in the R package
- Provide functionality on each of the ECCR websites, such that models can be selected and downloaded as PMML documents
