Package 'DEMOVA' reference manual

Title:	DEvelopment (of Multi-Linear QSPR/QSAR) MOdels VAlidated using Test Set
Description:	Tool for the development of multi-linear QSPR/QSAR models (Quantitative structure-property/activity relationship). Theses models are used in chemistry, biology and pharmacy to find a relationship between the structure of a molecule and its property (such as activity, toxicology but also physical properties). The various functions of this package allows: selection of descriptors based of variances, intercorrelation and user expertise; selection of the best multi-linear regression in terms of correlation and robustness; methods of internal validation (Leave-One-Out, Leave-Many-Out, Y-scrambling) and external using test sets.
Authors:	Vinca Prana
Maintainer:	Vinca Prana <[email protected]>
License:	GPL (>= 2)
Version:	1.0
Built:	2025-03-23 03:53:00 UTC
Source:	https://github.com/cran/DEMOVA

DEvelopment of (multi-linear QSPR/QSAR) MOdels VAlidated using test set.

Description

Tool for the development of multi-linear QSPR/QSAR models (Quantitative structure-property/activity relationship). Theses models are used in chemistry, biology and pharmacy to find a relationship between the structure of a molecule and its property (such as activity, toxicology but also physical properties). The various functions of this package allows: selection of descriptors based of variances, intercorrelation and user expertise; selection of the best multi-linear regression in terms of correlation and robustness; methods of internal validation (Leave-One-Out, Leave-Many-Out, Y-scrambling) and external using test sets.

Details

Package:	DEMOVA
Type:	Package
Version:	1.0
Date:	2016-03-15
License:	GPL (>= 2)

Example of input files are avaible into the floder "tests".
# data<-read.csv("NameOfInputFile.csv",header = TRUE , sep=" ")
# mydesc<-data[,3:dim[2]]
Functions should be use in this order:
- preselection
- select_variables
- select_MLR
- fit
- LOO / LMO / Scramb (No specific order between these ones. Optional to do the rest)
- prediction
- graphe_3Sets

Author(s)

Vinca Prana
Maintainer: Vinca Prana <[email protected]>

References

1. Selassie, C. D. History of Quantitative Structure-Activity Relationship; Burger's Medicinal Chemistry and Drug Discovery Sixth Edition; John Wiley & Sons Inc., 2002; Vol. 1. (2)
2. Willett, P. Chemoinformatics: a History. Wiley Interdisciplinary Reviews: Computational Molecular Science 2011, 1, 46-56.

Performance of selected model

Description

Perform a multi linear regression between property and previously selected descriptors (using select_MLR function).
Calculate R2 coefficient and the predicted values from the MLR. Trace the graph experimental values vs predicted values.

Usage

fitting(mydata, n, property)
fitting(mydata, n, property)

Arguments

`mydata`	Dataframe containing names and values of response and descriptors
`n`	Number of selected descriptors of the regression (determined using select_MLR function)
`property`	Name of the studied proterty

Value

`prediction_TrainSet_Y.csv`	File containing prediction obtained using the fitting
`Y_TrainingSet.tiff`	Image representing experimental values vs predicted values for the training set
`fit`	lm object return by the function

Examples

# First run select_MLR to define n
# y<-data[,2]
# mydata<-cbind(y,MLR)
# fit<-fitting(data,dim(MLR)[2],"Name of property")
# First run select_MLR to define n
# y<-data[,2]
# mydata<-cbind(y,MLR)
# fit<-fitting(data,dim(MLR)[2],"Name of property")

Predictions for the external validation set and graph

Description

Calulate the predicted values for the external validation set and trace the graph experimental values vs predicted values for training, test and external validation sets.

Usage

graphe_3Sets(fit, mydata, mynewdata, mynewdata2, n)
graphe_3Sets(fit, mydata, mynewdata, mynewdata2, n)

Arguments

`fit`	Multi linear regression between property and selected descriptors (lm object)
`mydata`	Dataframe containing names and values of response and descriptors
`mynewdata`	Dataframe containing property and selected descriptors values for the test set
`mynewdata2`	Dataframe containing property and selected descriptors values for the external validation set
`n`	Numbers of selected descriptors of the regression (determined using select_MLR)

Value

`Rext`, `Rext2`	return a list containing the value of the determination coefficient of the test set and of the external validation set
`Graphe_3sets.tiff`	Image representing experimental values vs predicted values for the all three sets

Examples

# This function have to be run last!

## "Test_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

# new_nom<-'Test_set.csv'
# newdata<-read.csv(new_nom,header=TRUE , sep=" ")
# mynewdata=newdata[,2:dim[2]]


## "External_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

# new_nom2<-'External_set.csv'
# newdata2<-read.csv(new_nom2,header=TRUE , sep=" ")
# mynewdata2=newdata2[,2:dim[2]]

#graphe_3Sets(fit,mynewdata,mynewdata2,dim(MLR)[2])
# This function have to be run last!

## "Test_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

# new_nom<-'Test_set.csv'
# newdata<-read.csv(new_nom,header=TRUE , sep=" ")
# mynewdata=newdata[,2:dim[2]]


## "External_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

# new_nom2<-'External_set.csv'
# newdata2<-read.csv(new_nom2,header=TRUE , sep=" ")
# mynewdata2=newdata2[,2:dim[2]]

#graphe_3Sets(fit,mynewdata,mynewdata2,dim(MLR)[2])

Leave Many Out

Description

Calculate the robustness of the equation using the leave many out method.

Usage

LMO(mydata, cv, n)
LMO(mydata, cv, n)

Arguments

`mydata`	Dataframe containing names and values of response and descriptors
`cv`	Numbers of fold
`n`	Numbers of selected descriptors of the regression (determined using Select_MLR)

Value

return Q2, the coefficient that measure the robstness

References

1. Gramatica, P. Principles of QSAR Models Validation: Internal and External. Qsar & Combinatorial Science 2007, 26, 694-701.
2. Golbraikh, A.; Tropsha, A. Beware of Q(2)! Journal of Molecular Graphics & Modelling 2002, 20, 269-276.

Examples

# First run Select_MLR to define n

#LMO(mydata,5,dim(MLR)[2])
#LMO(mydata,10,dim(MLR)[2])
# First run Select_MLR to define n

#LMO(mydata,5,dim(MLR)[2])
#LMO(mydata,10,dim(MLR)[2])

Leave One Out

Description

Calculate the robustness of the equation using the leave one out method.

Usage

LOO(mydata, n)
LOO(mydata, n)

Arguments

`mydata`	Dataframe containing names and values of response and descriptors
`n`	Numbers of selected descriptors of the regression (determined using Select_MLR)

Value

return Q2, the coefficient that measure the robstness

References

Examples

# First run Select_MLR to define n

# LOO(mydata,dim(MLR)[2])
# First run Select_MLR to define n

# LOO(mydata,dim(MLR)[2])

Predictions for the test set and graph

Description

Calulate the predicted values for the test set and trace the graph experimental values vs predicted values for both training and test sets. This function also give the R2 test coefficent.

Usage

prediction(fit, mydata, mynewdata, n)
prediction(fit, mydata, mynewdata, n)

Arguments

`fit`	Multi linear regression between property and selected descriptors
`mydata`	Dataframe containing names and values of response and descriptors
`mynewdata`	Dataframe containing property and selected descriptors values for the test set
`n`	Numbers of selected descriptors of the regression (determined using Select_MLR)

Value

`Exp.vs.Pred.tiff`	Image representing experimental values vs predicted values for the both sets
`Rext`	return the value of the determination coefficient of the test set

Examples

# This function have to be run after choise of the model.

## "Test_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

#new_nom<-'Test_set.csv'
#newdata<-read.csv(new_nom,header=TRUE , sep=" ")
#mynewdata=newdata[,2:dim[2]]

#prediction(fit,mynewdata,dim(MLR)[2])
# This function have to be run after choise of the model.

## "Test_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

#new_nom<-'Test_set.csv'
#newdata<-read.csv(new_nom,header=TRUE , sep=" ")
#mynewdata=newdata[,2:dim[2]]

#prediction(fit,mynewdata,dim(MLR)[2])

Suppression of missing or constant descriptors

Description

Remove descriptors with missing values and a variance lower than 0.001.

Usage

preselection(desc)
preselection(desc)

Arguments

desc

Dataframe containing the names of desciptors and their values

Value

return a dataframe without the removed variables

Examples

## The input file should be with the following form
## id_molecule propriete x1 x2 x3 ... # Header line
## molecule1 1 0.02 500 ...
## molecule2 5 0.06 600 ...

# nom<-"NameOfInputFile.csv"
# data<-read.csv(nom,header = TRUE , sep=" ") 
# dim<-dim(data)
# mydesc<-data[,3:dim[2]]
# id<-data[,1]
# y<-data[,2] 

# d<-preselection(mydesc)

## The input file should be with the following form
## id_molecule propriete x1 x2 x3 ... # Header line
## molecule1 1 0.02 500 ...
## molecule2 5 0.06 600 ...

# nom<-"NameOfInputFile.csv"
# data<-read.csv(nom,header = TRUE , sep=" ") 
# dim<-dim(data)
# mydesc<-data[,3:dim[2]]
# id<-data[,1]
# y<-data[,2] 

# d<-preselection(mydesc)

scrambling

Description

Perform the y-scrambling method that consit to permute y values and try to develop new models. They have to be unperformants in order to validate the original one. The graph R2 vs r(y,yrandom) is created.

Usage

scramb(mydata, k, n, cercle = FALSE)
scramb(mydata, k, n, cercle = FALSE)

Arguments

`mydata`	Dataframe containing names and values of response and descriptors
`k`	Number of random run
`n`	Number of selected descriptors of the regression (determined using Select_MLR)
`cercle`	Value is TRUE or FALSE (by default) . If it TRUE it's draw a circle around the point representinf the original model

Value

Return a list of

`mean`	Mean of R^2 new model
`sd`	RStandard deviation of R^2 new model

And also

`Scramb.tiff`	Description of 'comp1'
`Scramb.csv`	Description of 'comp2'

References

Tropsha, A.; Gramatica, P.; Gombar, V. K. The Importance of Being Earnest: Validation Is the Absolute Essential for Successful Application and Interpretation of QSPR Models. Qsar \& Combinatorial Science 2003, 22, 69-77.
Rucker, C.; Rucker, G.; Meringer, M. y-Randomization and Its Variants in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47, 2345-2357.
Lindgren, F.; Hansen, B.; Karcher, W.; Sjostrom, M.; Eriksson, L. Model Validation by Permutation Tests: Applications to Variable Selection. Journal of Chemometrics 1996, 10, 521-532.

Examples

# First run Select_MLR to define n

# scramb(mydata,1000,nom,dim(MLR)[2])
# First run Select_MLR to define n

# scramb(mydata,1000,nom,dim(MLR)[2])

Development of the model (multi linear regression)

Description

From a list of descriptors and responses values, this function choose the best compromise between correlation and robustness to select the best model.

Usage

select_MLR(y, desc, n, method = "forward")
select_MLR(y, desc, n, method = "forward")

Arguments

`y`	Vector with values of the property/response
`desc`	Dataframe containing the names of desciptors and their values
`n`	Maximal number (integer) of desciptors for the final equation
`method`	Determine the method used to build the regression. Can be: "backward", "forward" (by default) or "seqrep". For more info see leaps package.

Value

Return the list of selected variables for the choosen MLR.

Examples

# First run Select_variables to remove descriptors with missing or constant values.
# MLR<-select_MLR(y,desc,5)
# First run Select_variables to remove descriptors with missing or constant values.
# MLR<-select_MLR(y,desc,5)

Selection of descriptors

Description

This function allow the user to select wanted descriptors between both that are intercorrelated with a correlation coefficent higher that ThresholdInterCor. The selection can also be automatic based on the correlation with the property of each variables.

Usage

select_variables(id, y, d, ThresholdInterCor, auto = FALSE)
select_variables(id, y, d, ThresholdInterCor, auto = FALSE)

Arguments

`id`	List of the names of observations
`y`	List of the values of the property/response
`d`	Dataframe containing the names of desciptors and their values (without missing or constant values)
`ThresholdInterCor`	Threshold value (double) of the accepted intercorrelation between descriptors (should be between 0 and 1)
`auto`	Two possible values: TRUE or FALSE (by default). The selection of descriptors is done automatically based on the correlation between descriptor and property (auto=TRUE) or is done manually by user (auto=FALSE)

Value

return a dataframe containing only of non intercorrelated variables

Examples

# Run after Preselection : d<-Preselection(desc)

# desc<-select_variables(id,y,d,0.95)

# Run after Preselection : d<-Preselection(desc)

# desc<-select_variables(id,y,d,0.95)

Package 'DEMOVA'

Help Index

DEvelopment of (multi-linear QSPR/QSAR) MOdels VAlidated using test set.

Description

Details

Author(s)

References

Performance of selected model

Description

Usage

Arguments

Value

Examples

Predictions for the external validation set and graph

Description

Usage

Arguments

Value

Examples

Leave Many Out

Description

Usage

Arguments

Value

References

Examples

Leave One Out

Description

Usage

Arguments

Value

References

Examples

Predictions for the test set and graph

Description

Usage

Arguments

Value

Examples

Suppression of missing or constant descriptors

Description

Usage

Arguments

Value

Examples

scrambling

Description

Usage

Arguments

Value

References

Examples

Development of the model (multi linear regression)

Description

Usage

Arguments

Value

Examples

Selection of descriptors

Description

Usage

Arguments

Value

Examples