[r-cran-caret] 02/05: Add description of data files

Balint Reczey rbalint at moszumanska.debian.org
Mon Feb 15 20:23:38 UTC 2016


This is an automated email from the git hooks/post-receive script.

rbalint pushed a commit to branch master
in repository r-cran-caret.

commit 18783f0b915263068d279b8776612e83869e8753
Author: Balint Reczey <balint at balintreczey.hu>
Date:   Sat Dec 12 11:33:10 2015 +0100

    Add description of data files
---
 debian/README.source | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 173 insertions(+)

diff --git a/debian/README.source b/debian/README.source
new file mode 100644
index 0000000..fd7f5d0
--- /dev/null
+++ b/debian/README.source
@@ -0,0 +1,173 @@
+Explanation for binary files inside source package according to
+  http://lists.debian.org/debian-devel/2013/09/msg00332.html
+
+This package contains sample data files for experimenting with
+the implemented algorithms.
+
+Here comes a description of the single data files:
+
+Files: data/*
+Documentation: man/GenABEL.data-package.Rd
+  GenABEL.data contains six files with data which is used by examples of GenABEL.
+  These are ge03d2.clean.RData, ge03d2c.RData, ge03d2ex.clean.RData, ge03d2ex.RData, ge03d2.RData and srdta.RData.
+
+Files: data/ge03d2.Rdata
+Documentation: man/ge03d2.Rd
+        A small data set (approximately 1,000 people and 8,000 SNPs) containing
+        data on 3 autosomes and X chromsome. Is a good set for
+        demonatration of the QC procedures (different genotyping errors
+        are introduced) and GWA analysis.
+        This data set was developed for the "Advances in population-
+
+Files: data/BloodBrain.RData
+Documentation: man/BloodBrain.Rd
+     Mente and Lombardo (2005) develop models to predict the log of the
+     ratio of the concentration of a compound in the brain and the
+     concentration in blood. For each compound, they computed three
+     sets of molecular descriptors: MOE 2D, rule-of-five and Charge
+     Polar Surface Area (CPSA). In all, 134 descriptors were
+     calculated. Included in this package are 208 non-proprietary
+     literature compounds. The vector ‘logBBB’ contains the
+     concentration ratio and the data fame ‘bbbDescr’ contains the
+     descriptor values.
+
+Files: data/cars.RData
+Documentation: man/cars.Rd
+     Kuiper (2008) collected data on Kelly Blue Book resale data for
+     804 GM cars (2005 model year).
+
+Files: data/cox2.RData
+Documentation: man/cox2.Rd
+     From Sutherland, O'Brien, and Weaver (2003): "A set of 467
+     cyclooxygenase-2 (COX-2) inhibitors has been assembled from the
+     published work of a single research group, with in vitro
+     activities against human recombinant enzyme expressed as IC50
+     values ranging from 1 nM to >100 uM (53 compounds have
+     indeterminate IC50 values)."
+
+     The data are in the Supplemental Data file for the article.
+
+     A set of 255 descriptors (MOE2D and QikProp) were generated. To
+     classify the data, we used a cutoff of $2^2.5$ to determine
+     activity
+
+Files: data/dhfr.RData
+Documentation: man/dhfr.Rd
+     Sutherland and Weaver (2004) discuss QSAR models for dihydrofolate
+     reductase (DHFR) inhibition. This data set contains values for 325
+     compounds. For each compound, 228 molecular descriptors have been
+     calculated. Additionally, each samples is designated as "active"
+     or "inactive".
+
+     The data frame ‘dhfr’ contains a column called ‘Y’ with the
+     outcome classification. The remainder of the columns are molecular
+     descriptor values.
+
+Files: data/GermanCredit.RData
+Documentation: man/GermanCredit.Rd
+     Data from Dr. Hans Hofmann of the University of Hamburg.
+
+     These data have two classes for the credit worthiness: good or
+     bad. There are predictors related to attributes, such as: checking
+     account status, duration, credit history, purpose of the loan,
+     amount of the loan, savings accounts or bonds, employment
+     duration, Installment rate in percentage of disposable income,
+     personal information, other debtors/guarantors, residence
+     duration, property, age, other installment plans, housing, number
+     of existing credits, job information, Number of people being
+     liable to provide maintenance for, telephone, and foreign worker
+     status.
+
+     Many of these predictors are discrete and have been expanded into
+     several 0/1 indicator variables
+
+Files: data/mdrr.RData
+Documentation: man/mdrr.Rd
+     Svetnik et al. (2003) describe these data: "Bakken and Jurs
+     studied a set of compounds originally discussed by Klopman et al.,
+     who were interested in multidrug resistance reversal (MDRR)
+     agents. The original response variable is a ratio measuring the
+     ability of a compound to reverse a leukemia cell's resistance to
+     adriamycin. However, the problem was treated as a classification
+     problem, and compounds with the ratio >4.2 were considered active,
+     and those with the ratio <= 2.0 were considered inactive.
+     Compounds with the ratio between these two cutoffs were called
+     moderate and removed from the data for twoclass classification,
+     leaving a set of 528 compounds (298 actives and 230 inactives).
+     (Various other arrangements of these data were examined by Bakken
+     and Jurs, but we will focus on this particular one.) We did not
+     have access to the original descriptors, but we generated a set of
+     342 descriptors of three different types that should be similar to
+     the original descriptors, using the DRAGON software."
+
+     The data and R code are in the Supplemental Data file for the
+     article.
+
+
+Files: data/oil.RData
+Documentation: man/oil.Rd
+     Fatty acid concentrations of commercial oils were measured using
+     gas chromatography.  The data is used to predict the type of oil.
+     Note that only the known oils are in the data set. Also, the
+     authors state that there are 95 samples of known oils. However, we
+     count 96 in Table 1 (pgs.  33-35).
+
+Files: data/pottery.RData
+Documentation: man/pottery.Rd
+     Measurements of 58 pottery samples.
+Source:
+     R. G. Brereton (2003). _Chemometrics: Data Analysis for the
+     Laboratory and Chemical Plant_, pg. 261.
+
+Files: data/segmentationData.RData
+Documentation: man/segmentationData.Rd
+     Hill, LaPan, Li and Haney (2007) develop models to predict which
+     cells in a high content screen were well segmented.  The data
+     consists of 119 imaging measurements on 2019. The original
+     analysis used 1009 for training and 1010 as a test set (see the
+     column called ‘Case’).
+
+     The outcome class is contained in a factor variable called ‘Class’
+     with levels "PS" for poorly segmented and "WS" for well segmented.
+
+     The raw data used in the paper can be found at the Biomedcentral
+     website. Versions of caret < 4.98 contained the original data. The
+     version now contained in ‘segmentationData’ is modified. First,
+     several discrete versions of some of the predictors (with the
+     suffix "Status") were removed. Second, there are several skewed
+     predictors with minimum values of zero (that would benefit from
+     some transformation, such as the log). A constant value of 1 was
+     added to these fields: ‘AvgIntenCh2’, ‘FiberAlign2Ch3’,
+     ‘FiberAlign2Ch4’, ‘SpotFiberCountCh4’ and ‘TotalIntenCh2’.
+
+     A binary version of the original data is at <URL:
+     http://topepo.github.io/caret/segmentationOriginal.RData>.
+
+Files: data/tecator.RData
+Documentation: man/tecator.Rd
+     "These data are recorded on a Tecator Infratec Food and Feed
+     Analyzer working in the wavelength range 850 - 1050 nm by the Near
+     Infrared Transmission (NIT) principle. Each sample contains finely
+     chopped pure meat with different moisture, fat and protein
+     contents.
+
+     If results from these data are used in a publication we want you
+     to mention the instrument and company name (Tecator) in the
+     publication.  In addition, please send a preprint of your article
+     to
+
+     Karin Thente, Tecator AB, Box 70, S-263 21 Hoganas, Sweden
+
+     The data are available in the public domain with no responsibility
+     from the original data source. The data can be redistributed as
+     long as this permission note is attached."
+
+     "For each meat sample the data consists of a 100 channel spectrum
+     of absorbances and the contents of moisture (water), fat and
+     protein.  The absorbance is -log10 of the transmittance measured
+     by the spectrometer. The three contents, measured in percent, are
+     determined by analytic chemistry."
+
+     Included here are the traning, monitoring and test sets.
+
+ -- Balint Reczey <balint at balintreczey.hu>, Sat, 12 Dec 2015 11:18:34 +0100

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/r-cran-caret.git



More information about the debian-science-commits mailing list