[r-cran-caret] 02/05: Add description of data files
Balint Reczey
rbalint at moszumanska.debian.org
Mon Feb 15 20:23:38 UTC 2016
This is an automated email from the git hooks/post-receive script.
rbalint pushed a commit to branch master
in repository r-cran-caret.
commit 18783f0b915263068d279b8776612e83869e8753
Author: Balint Reczey <balint at balintreczey.hu>
Date: Sat Dec 12 11:33:10 2015 +0100
Add description of data files
---
debian/README.source | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 173 insertions(+)
diff --git a/debian/README.source b/debian/README.source
new file mode 100644
index 0000000..fd7f5d0
--- /dev/null
+++ b/debian/README.source
@@ -0,0 +1,173 @@
+Explanation for binary files inside source package according to
+ http://lists.debian.org/debian-devel/2013/09/msg00332.html
+
+This package contains sample data files for experimenting with
+the implemented algorithms.
+
+Here comes a description of the single data files:
+
+Files: data/*
+Documentation: man/GenABEL.data-package.Rd
+ GenABEL.data contains six files with data which is used by examples of GenABEL.
+ These are ge03d2.clean.RData, ge03d2c.RData, ge03d2ex.clean.RData, ge03d2ex.RData, ge03d2.RData and srdta.RData.
+
+Files: data/ge03d2.Rdata
+Documentation: man/ge03d2.Rd
+ A small data set (approximately 1,000 people and 8,000 SNPs) containing
+ data on 3 autosomes and X chromsome. Is a good set for
+ demonatration of the QC procedures (different genotyping errors
+ are introduced) and GWA analysis.
+ This data set was developed for the "Advances in population-
+
+Files: data/BloodBrain.RData
+Documentation: man/BloodBrain.Rd
+ Mente and Lombardo (2005) develop models to predict the log of the
+ ratio of the concentration of a compound in the brain and the
+ concentration in blood. For each compound, they computed three
+ sets of molecular descriptors: MOE 2D, rule-of-five and Charge
+ Polar Surface Area (CPSA). In all, 134 descriptors were
+ calculated. Included in this package are 208 non-proprietary
+ literature compounds. The vector ‘logBBB’ contains the
+ concentration ratio and the data fame ‘bbbDescr’ contains the
+ descriptor values.
+
+Files: data/cars.RData
+Documentation: man/cars.Rd
+ Kuiper (2008) collected data on Kelly Blue Book resale data for
+ 804 GM cars (2005 model year).
+
+Files: data/cox2.RData
+Documentation: man/cox2.Rd
+ From Sutherland, O'Brien, and Weaver (2003): "A set of 467
+ cyclooxygenase-2 (COX-2) inhibitors has been assembled from the
+ published work of a single research group, with in vitro
+ activities against human recombinant enzyme expressed as IC50
+ values ranging from 1 nM to >100 uM (53 compounds have
+ indeterminate IC50 values)."
+
+ The data are in the Supplemental Data file for the article.
+
+ A set of 255 descriptors (MOE2D and QikProp) were generated. To
+ classify the data, we used a cutoff of $2^2.5$ to determine
+ activity
+
+Files: data/dhfr.RData
+Documentation: man/dhfr.Rd
+ Sutherland and Weaver (2004) discuss QSAR models for dihydrofolate
+ reductase (DHFR) inhibition. This data set contains values for 325
+ compounds. For each compound, 228 molecular descriptors have been
+ calculated. Additionally, each samples is designated as "active"
+ or "inactive".
+
+ The data frame ‘dhfr’ contains a column called ‘Y’ with the
+ outcome classification. The remainder of the columns are molecular
+ descriptor values.
+
+Files: data/GermanCredit.RData
+Documentation: man/GermanCredit.Rd
+ Data from Dr. Hans Hofmann of the University of Hamburg.
+
+ These data have two classes for the credit worthiness: good or
+ bad. There are predictors related to attributes, such as: checking
+ account status, duration, credit history, purpose of the loan,
+ amount of the loan, savings accounts or bonds, employment
+ duration, Installment rate in percentage of disposable income,
+ personal information, other debtors/guarantors, residence
+ duration, property, age, other installment plans, housing, number
+ of existing credits, job information, Number of people being
+ liable to provide maintenance for, telephone, and foreign worker
+ status.
+
+ Many of these predictors are discrete and have been expanded into
+ several 0/1 indicator variables
+
+Files: data/mdrr.RData
+Documentation: man/mdrr.Rd
+ Svetnik et al. (2003) describe these data: "Bakken and Jurs
+ studied a set of compounds originally discussed by Klopman et al.,
+ who were interested in multidrug resistance reversal (MDRR)
+ agents. The original response variable is a ratio measuring the
+ ability of a compound to reverse a leukemia cell's resistance to
+ adriamycin. However, the problem was treated as a classification
+ problem, and compounds with the ratio >4.2 were considered active,
+ and those with the ratio <= 2.0 were considered inactive.
+ Compounds with the ratio between these two cutoffs were called
+ moderate and removed from the data for twoclass classification,
+ leaving a set of 528 compounds (298 actives and 230 inactives).
+ (Various other arrangements of these data were examined by Bakken
+ and Jurs, but we will focus on this particular one.) We did not
+ have access to the original descriptors, but we generated a set of
+ 342 descriptors of three different types that should be similar to
+ the original descriptors, using the DRAGON software."
+
+ The data and R code are in the Supplemental Data file for the
+ article.
+
+
+Files: data/oil.RData
+Documentation: man/oil.Rd
+ Fatty acid concentrations of commercial oils were measured using
+ gas chromatography. The data is used to predict the type of oil.
+ Note that only the known oils are in the data set. Also, the
+ authors state that there are 95 samples of known oils. However, we
+ count 96 in Table 1 (pgs. 33-35).
+
+Files: data/pottery.RData
+Documentation: man/pottery.Rd
+ Measurements of 58 pottery samples.
+Source:
+ R. G. Brereton (2003). _Chemometrics: Data Analysis for the
+ Laboratory and Chemical Plant_, pg. 261.
+
+Files: data/segmentationData.RData
+Documentation: man/segmentationData.Rd
+ Hill, LaPan, Li and Haney (2007) develop models to predict which
+ cells in a high content screen were well segmented. The data
+ consists of 119 imaging measurements on 2019. The original
+ analysis used 1009 for training and 1010 as a test set (see the
+ column called ‘Case’).
+
+ The outcome class is contained in a factor variable called ‘Class’
+ with levels "PS" for poorly segmented and "WS" for well segmented.
+
+ The raw data used in the paper can be found at the Biomedcentral
+ website. Versions of caret < 4.98 contained the original data. The
+ version now contained in ‘segmentationData’ is modified. First,
+ several discrete versions of some of the predictors (with the
+ suffix "Status") were removed. Second, there are several skewed
+ predictors with minimum values of zero (that would benefit from
+ some transformation, such as the log). A constant value of 1 was
+ added to these fields: ‘AvgIntenCh2’, ‘FiberAlign2Ch3’,
+ ‘FiberAlign2Ch4’, ‘SpotFiberCountCh4’ and ‘TotalIntenCh2’.
+
+ A binary version of the original data is at <URL:
+ http://topepo.github.io/caret/segmentationOriginal.RData>.
+
+Files: data/tecator.RData
+Documentation: man/tecator.Rd
+ "These data are recorded on a Tecator Infratec Food and Feed
+ Analyzer working in the wavelength range 850 - 1050 nm by the Near
+ Infrared Transmission (NIT) principle. Each sample contains finely
+ chopped pure meat with different moisture, fat and protein
+ contents.
+
+ If results from these data are used in a publication we want you
+ to mention the instrument and company name (Tecator) in the
+ publication. In addition, please send a preprint of your article
+ to
+
+ Karin Thente, Tecator AB, Box 70, S-263 21 Hoganas, Sweden
+
+ The data are available in the public domain with no responsibility
+ from the original data source. The data can be redistributed as
+ long as this permission note is attached."
+
+ "For each meat sample the data consists of a 100 channel spectrum
+ of absorbances and the contents of moisture (water), fat and
+ protein. The absorbance is -log10 of the transmittance measured
+ by the spectrometer. The three contents, measured in percent, are
+ determined by analytic chemistry."
+
+ Included here are the traning, monitoring and test sets.
+
+ -- Balint Reczey <balint at balintreczey.hu>, Sat, 12 Dec 2015 11:18:34 +0100
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/r-cran-caret.git
More information about the debian-science-commits
mailing list