Title: | DEvelopment (of Multi-Linear QSPR/QSAR) MOdels VAlidated using Test Set |
---|---|
Description: | Tool for the development of multi-linear QSPR/QSAR models (Quantitative structure-property/activity relationship). Theses models are used in chemistry, biology and pharmacy to find a relationship between the structure of a molecule and its property (such as activity, toxicology but also physical properties). The various functions of this package allows: selection of descriptors based of variances, intercorrelation and user expertise; selection of the best multi-linear regression in terms of correlation and robustness; methods of internal validation (Leave-One-Out, Leave-Many-Out, Y-scrambling) and external using test sets. |
Authors: | Vinca Prana |
Maintainer: | Vinca Prana <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0 |
Built: | 2025-01-22 03:10:26 UTC |
Source: | https://github.com/cran/DEMOVA |
Tool for the development of multi-linear QSPR/QSAR models (Quantitative structure-property/activity relationship). Theses models are used in chemistry, biology and pharmacy to find a relationship between the structure of a molecule and its property (such as activity, toxicology but also physical properties). The various functions of this package allows: selection of descriptors based of variances, intercorrelation and user expertise; selection of the best multi-linear regression in terms of correlation and robustness; methods of internal validation (Leave-One-Out, Leave-Many-Out, Y-scrambling) and external using test sets.
Package: | DEMOVA |
Type: | Package |
Version: | 1.0 |
Date: | 2016-03-15 |
License: | GPL (>= 2) |
Example of input files are avaible into the floder "tests".
# data<-read.csv("NameOfInputFile.csv",header = TRUE , sep=" ")
# mydesc<-data[,3:dim[2]]
Functions should be use in this order:
- preselection
- select_variables
- select_MLR
- fit
- LOO / LMO / Scramb (No specific order between these ones. Optional to do the rest)
- prediction
- graphe_3Sets
Vinca Prana
Maintainer: Vinca Prana <[email protected]>
1. Selassie, C. D. History of Quantitative Structure-Activity Relationship; Burger's Medicinal Chemistry and Drug Discovery Sixth Edition; John Wiley & Sons Inc., 2002; Vol. 1. (2)
2. Willett, P. Chemoinformatics: a History. Wiley Interdisciplinary Reviews: Computational Molecular Science 2011, 1, 46-56.
Perform a multi linear regression between property and previously selected descriptors (using select_MLR function).
Calculate R2 coefficient and the predicted values from the MLR. Trace the graph experimental values vs predicted values.
fitting(mydata, n, property)
fitting(mydata, n, property)
mydata |
Dataframe containing names and values of response and descriptors |
n |
Number of selected descriptors of the regression (determined using select_MLR function) |
property |
Name of the studied proterty |
prediction_TrainSet_Y.csv |
File containing prediction obtained using the fitting |
Y_TrainingSet.tiff |
Image representing experimental values vs predicted values for the training set |
fit |
lm object return by the function |
# First run select_MLR to define n # y<-data[,2] # mydata<-cbind(y,MLR) # fit<-fitting(data,dim(MLR)[2],"Name of property")
# First run select_MLR to define n # y<-data[,2] # mydata<-cbind(y,MLR) # fit<-fitting(data,dim(MLR)[2],"Name of property")
Calulate the predicted values for the external validation set and trace the graph experimental values vs predicted values for training, test and external validation sets.
graphe_3Sets(fit, mydata, mynewdata, mynewdata2, n)
graphe_3Sets(fit, mydata, mynewdata, mynewdata2, n)
fit |
Multi linear regression between property and selected descriptors (lm object) |
mydata |
Dataframe containing names and values of response and descriptors |
mynewdata |
Dataframe containing property and selected descriptors values for the test set |
mynewdata2 |
Dataframe containing property and selected descriptors values for the external validation set |
n |
Numbers of selected descriptors of the regression (determined using select_MLR) |
Rext , Rext2
|
return a list containing the value of the determination coefficient of the test set and of the external validation set |
Graphe_3sets.tiff |
Image representing experimental values vs predicted values for the all three sets |
# This function have to be run last! ## "Test_set.csv" should be with the following form ## ID property SelectedDesc1 SelectedDesc2 ... # new_nom<-'Test_set.csv' # newdata<-read.csv(new_nom,header=TRUE , sep=" ") # mynewdata=newdata[,2:dim[2]] ## "External_set.csv" should be with the following form ## ID property SelectedDesc1 SelectedDesc2 ... # new_nom2<-'External_set.csv' # newdata2<-read.csv(new_nom2,header=TRUE , sep=" ") # mynewdata2=newdata2[,2:dim[2]] #graphe_3Sets(fit,mynewdata,mynewdata2,dim(MLR)[2])
# This function have to be run last! ## "Test_set.csv" should be with the following form ## ID property SelectedDesc1 SelectedDesc2 ... # new_nom<-'Test_set.csv' # newdata<-read.csv(new_nom,header=TRUE , sep=" ") # mynewdata=newdata[,2:dim[2]] ## "External_set.csv" should be with the following form ## ID property SelectedDesc1 SelectedDesc2 ... # new_nom2<-'External_set.csv' # newdata2<-read.csv(new_nom2,header=TRUE , sep=" ") # mynewdata2=newdata2[,2:dim[2]] #graphe_3Sets(fit,mynewdata,mynewdata2,dim(MLR)[2])
Calculate the robustness of the equation using the leave many out method.
LMO(mydata, cv, n)
LMO(mydata, cv, n)
mydata |
Dataframe containing names and values of response and descriptors |
cv |
Numbers of fold |
n |
Numbers of selected descriptors of the regression (determined using Select_MLR) |
return Q2, the coefficient that measure the robstness
1. Gramatica, P. Principles of QSAR Models Validation: Internal and External. Qsar &
Combinatorial Science 2007, 26, 694-701.
2. Golbraikh, A.; Tropsha, A. Beware of Q(2)! Journal of Molecular Graphics & Modelling 2002,
20, 269-276.
# First run Select_MLR to define n #LMO(mydata,5,dim(MLR)[2]) #LMO(mydata,10,dim(MLR)[2])
# First run Select_MLR to define n #LMO(mydata,5,dim(MLR)[2]) #LMO(mydata,10,dim(MLR)[2])
Calculate the robustness of the equation using the leave one out method.
LOO(mydata, n)
LOO(mydata, n)
mydata |
Dataframe containing names and values of response and descriptors |
n |
Numbers of selected descriptors of the regression (determined using Select_MLR) |
return Q2, the coefficient that measure the robstness
1. Gramatica, P. Principles of QSAR Models Validation: Internal and External. Qsar & Combinatorial Science 2007, 26, 694-701.
2. Golbraikh, A.; Tropsha, A. Beware of Q(2)! Journal of Molecular Graphics & Modelling 2002, 20, 269-276.
# First run Select_MLR to define n # LOO(mydata,dim(MLR)[2])
# First run Select_MLR to define n # LOO(mydata,dim(MLR)[2])
Calulate the predicted values for the test set and trace the graph experimental values vs predicted values for both training and test sets. This function also give the R2 test coefficent.
prediction(fit, mydata, mynewdata, n)
prediction(fit, mydata, mynewdata, n)
fit |
Multi linear regression between property and selected descriptors |
mydata |
Dataframe containing names and values of response and descriptors |
mynewdata |
Dataframe containing property and selected descriptors values for the test set |
n |
Numbers of selected descriptors of the regression (determined using Select_MLR) |
Exp.vs.Pred.tiff |
Image representing experimental values vs predicted values for the both sets |
Rext |
return the value of the determination coefficient of the test set |
# This function have to be run after choise of the model. ## "Test_set.csv" should be with the following form ## ID property SelectedDesc1 SelectedDesc2 ... #new_nom<-'Test_set.csv' #newdata<-read.csv(new_nom,header=TRUE , sep=" ") #mynewdata=newdata[,2:dim[2]] #prediction(fit,mynewdata,dim(MLR)[2])
# This function have to be run after choise of the model. ## "Test_set.csv" should be with the following form ## ID property SelectedDesc1 SelectedDesc2 ... #new_nom<-'Test_set.csv' #newdata<-read.csv(new_nom,header=TRUE , sep=" ") #mynewdata=newdata[,2:dim[2]] #prediction(fit,mynewdata,dim(MLR)[2])
Remove descriptors with missing values and a variance lower than 0.001.
preselection(desc)
preselection(desc)
desc |
Dataframe containing the names of desciptors and their values |
return a dataframe without the removed variables
## The input file should be with the following form ## id_molecule propriete x1 x2 x3 ... # Header line ## molecule1 1 0.02 500 ... ## molecule2 5 0.06 600 ... # nom<-"NameOfInputFile.csv" # data<-read.csv(nom,header = TRUE , sep=" ") # dim<-dim(data) # mydesc<-data[,3:dim[2]] # id<-data[,1] # y<-data[,2] # d<-preselection(mydesc)
## The input file should be with the following form ## id_molecule propriete x1 x2 x3 ... # Header line ## molecule1 1 0.02 500 ... ## molecule2 5 0.06 600 ... # nom<-"NameOfInputFile.csv" # data<-read.csv(nom,header = TRUE , sep=" ") # dim<-dim(data) # mydesc<-data[,3:dim[2]] # id<-data[,1] # y<-data[,2] # d<-preselection(mydesc)
Perform the y-scrambling method that consit to permute y values and try to develop new models. They have to be unperformants in order to validate the original one. The graph R2 vs r(y,yrandom) is created.
scramb(mydata, k, n, cercle = FALSE)
scramb(mydata, k, n, cercle = FALSE)
mydata |
Dataframe containing names and values of response and descriptors |
k |
Number of random run |
n |
Number of selected descriptors of the regression (determined using Select_MLR) |
cercle |
Value is TRUE or FALSE (by default) . If it TRUE it's draw a circle around the point representinf the original model |
Return a list of
mean |
Mean of R^2 new model |
sd |
RStandard deviation of R^2 new model |
And also
Scramb.tiff |
Description of 'comp1' |
Scramb.csv |
Description of 'comp2' |
Tropsha, A.; Gramatica, P.; Gombar, V. K. The Importance of Being Earnest: Validation Is the
Absolute Essential for Successful Application and Interpretation of QSPR Models. Qsar \&
Combinatorial Science 2003, 22, 69-77.
Rucker, C.; Rucker, G.; Meringer, M. y-Randomization and Its Variants in QSPR/QSAR. J.
Chem. Inf. Model. 2007, 47, 2345-2357.
Lindgren, F.; Hansen, B.; Karcher, W.; Sjostrom, M.; Eriksson, L. Model Validation by
Permutation Tests: Applications to Variable Selection. Journal of Chemometrics 1996, 10, 521-532.
# First run Select_MLR to define n # scramb(mydata,1000,nom,dim(MLR)[2])
# First run Select_MLR to define n # scramb(mydata,1000,nom,dim(MLR)[2])
From a list of descriptors and responses values, this function choose the best compromise between correlation and robustness to select the best model.
select_MLR(y, desc, n, method = "forward")
select_MLR(y, desc, n, method = "forward")
y |
Vector with values of the property/response |
desc |
Dataframe containing the names of desciptors and their values |
n |
Maximal number (integer) of desciptors for the final equation |
method |
Determine the method used to build the regression. Can be: "backward", "forward" (by default) or "seqrep". For more info see leaps package. |
Return the list of selected variables for the choosen MLR.
# First run Select_variables to remove descriptors with missing or constant values. # MLR<-select_MLR(y,desc,5)
# First run Select_variables to remove descriptors with missing or constant values. # MLR<-select_MLR(y,desc,5)
This function allow the user to select wanted descriptors between both that are intercorrelated with a correlation coefficent higher that ThresholdInterCor. The selection can also be automatic based on the correlation with the property of each variables.
select_variables(id, y, d, ThresholdInterCor, auto = FALSE)
select_variables(id, y, d, ThresholdInterCor, auto = FALSE)
id |
List of the names of observations |
y |
List of the values of the property/response |
d |
Dataframe containing the names of desciptors and their values (without missing or constant values) |
ThresholdInterCor |
Threshold value (double) of the accepted intercorrelation between descriptors (should be between 0 and 1) |
auto |
Two possible values: TRUE or FALSE (by default). The selection of descriptors is done automatically based on the correlation between descriptor and property (auto=TRUE) or is done manually by user (auto=FALSE) |
return a dataframe containing only of non intercorrelated variables
# Run after Preselection : d<-Preselection(desc) # desc<-select_variables(id,y,d,0.95)
# Run after Preselection : d<-Preselection(desc) # desc<-select_variables(id,y,d,0.95)