[r-cran-gower] 01/02: New upstream version 0.1.2

Andreas Tille tille at debian.org
Mon Oct 23 09:29:12 UTC 2017


This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository r-cran-gower.

commit cf0a4fdd1c2f5430680f8ca8e8c5798b0391ef41
Author: Andreas Tille <tille at debian.org>
Date:   Mon Oct 23 11:28:38 2017 +0200

    New upstream version 0.1.2
---
 DESCRIPTION                 |  23 ++
 MD5                         |  18 ++
 NAMESPACE                   |   5 +
 NEWS                        |  12 +
 R/gower-pkg.R               |  20 ++
 R/gower.R                   | 146 +++++++++
 build/vignette.rds          | Bin 0 -> 218 bytes
 inst/doc/intro.R            |  21 ++
 inst/doc/intro.Rmd          |  78 +++++
 inst/doc/intro.html         | 158 ++++++++++
 man/gower-package.Rd        |   9 +
 man/gower_dist.Rd           |  60 ++++
 man/gower_topn.Rd           |  49 +++
 src/Makevars                |   4 +
 src/R_register_native.c     |  30 ++
 src/gower.c                 | 722 ++++++++++++++++++++++++++++++++++++++++++++
 tests/testthat.R            |   3 +
 tests/testthat/test_gower.R | 118 ++++++++
 vignettes/intro.Rmd         |  78 +++++
 19 files changed, 1554 insertions(+)

diff --git a/DESCRIPTION b/DESCRIPTION
new file mode 100644
index 0000000..1030463
--- /dev/null
+++ b/DESCRIPTION
@@ -0,0 +1,23 @@
+Package: gower
+Maintainer: Mark van der Loo <mark.vanderloo at gmail.com>
+License: GPL-3
+Title: Gower's Distance
+LazyData: no
+Type: Package
+LazyLoad: yes
+Authors at R: c( person("Mark", "van der Loo", role=c("aut","cre"),email="mark.vanderloo at gmail.com"))
+Description: Compute Gower's distance (or similarity) coefficient between records. Compute 
+    the top-n matches between records. Core algorithms are executed in parallel on systems
+    supporting OpenMP.
+Version: 0.1.2
+URL: https://github.com/markvanderloo/gower
+BugReports: https://github.com/markvanderloo/gower/issues
+Date: 2017-02-23
+Suggests: testthat, knitr, rmarkdown
+RoxygenNote: 6.0.1
+VignetteBuilder: knitr
+NeedsCompilation: yes
+Packaged: 2017-02-23 20:13:17 UTC; mark
+Author: Mark van der Loo [aut, cre]
+Repository: CRAN
+Date/Publication: 2017-02-23 23:35:17
diff --git a/MD5 b/MD5
new file mode 100644
index 0000000..9338453
--- /dev/null
+++ b/MD5
@@ -0,0 +1,18 @@
+feb1cf5c6d28497cf375369bfa5c0cab *DESCRIPTION
+83404f04f998736f089f40b74968e0a3 *NAMESPACE
+d2cd60900c18a09051290f7cb994771e *NEWS
+2a15d8f245c2f6d3cd49f98d1c760bbd *R/gower-pkg.R
+6152fcc7e8ce26080d8e7a10d70cc5bb *R/gower.R
+00948546dfa81f53c4bebf838eb2d98a *build/vignette.rds
+dedc916a69d5cf53c8285208f4684caa *inst/doc/intro.R
+060e532599d0221cfb9e3be9fa0329ae *inst/doc/intro.Rmd
+7da7d4be933ba19e74db91a4a03db746 *inst/doc/intro.html
+68b9193cbb24435b2ecb773b97b23ccb *man/gower-package.Rd
+47011f697c87c9a551f9adbfb126f7b3 *man/gower_dist.Rd
+8f4ea66daefdb202361ce25902917980 *man/gower_topn.Rd
+4e95da77024f1ca60e63ce4aabf063c1 *src/Makevars
+8acb5e89b5fccf4ee507995a333b5572 *src/R_register_native.c
+2b99d922d1401ac4c1ebe96c2c93aa4d *src/gower.c
+3009b35ff0db11976f27dba9963141f3 *tests/testthat.R
+7fcab2ed91229b68f9a721450cb328d8 *tests/testthat/test_gower.R
+060e532599d0221cfb9e3be9fa0329ae *vignettes/intro.Rmd
diff --git a/NAMESPACE b/NAMESPACE
new file mode 100644
index 0000000..cba2749
--- /dev/null
+++ b/NAMESPACE
@@ -0,0 +1,5 @@
+# Generated by roxygen2: do not edit by hand
+
+export(gower_dist)
+export(gower_topn)
+useDynLib(gower, .registration=TRUE)
diff --git a/NEWS b/NEWS
new file mode 100644
index 0000000..50ffde0
--- /dev/null
+++ b/NEWS
@@ -0,0 +1,12 @@
+version 0.1.2
+- fixed valgrind warning
+- registered native routines, as now recommended by CRAN
+
+version 0.1.1
+- gower_dist now issues warning when nr of rows on input do not divide (recycling)
+- Code now depends on gcc version for compatability with Windows + R<=3.2.x
+- Bugfix: bad range calculation caused faulty distance computation
+
+version 0.1.0 
+- First release
+
diff --git a/R/gower-pkg.R b/R/gower-pkg.R
new file mode 100644
index 0000000..fb04463
--- /dev/null
+++ b/R/gower-pkg.R
@@ -0,0 +1,20 @@
+#' Gower's distance/similarity measure.
+#'
+#' A C-based implementation of Gower's distance.
+#' 
+#' @name gower-package
+#' @docType package
+#' @useDynLib gower, .registration=TRUE
+#' 
+{}
+
+
+.onLoad <- function(libname, pkgname){
+  max_threads <- 1L
+  max_threads <- .Call("R_get_max_threads")
+  thread_limit <- .Call("R_get_thread_limit")
+  max_threads <- min(max_threads, thread_limit)
+  # leave one core for the user to control the machine.
+  if (max_threads > 2) max_threads <- max_threads - 1L
+  options(gd_num_thread = as.integer(max_threads))
+}
diff --git a/R/gower.R b/R/gower.R
new file mode 100644
index 0000000..7721bff
--- /dev/null
+++ b/R/gower.R
@@ -0,0 +1,146 @@
+
+
+
+#' Gower's distance
+#'
+#' Compute Gower's distance, pairwise between records in two data sets \code{x} 
+#' and \code{y}. Records from the smallest data set are recycled over.
+#' 
+#' @section Details:
+#' There are three ways to specify which columns of \code{x} should be compared
+#' with what columns of \code{y}. The first option is do give no specification. 
+#' In that case columns with matching names will be used. The second option
+#' is to use only the \code{pairs_y} argument, specifying for each column in \code{x}
+#' in order, which column in \code{y} must be used to pair it with (use \code{0} 
+#' to skip a column in \code{x}). The third option is to explicitly specify the
+#' columns to be matched using \code{pair_x} and \code{pair_y}.
+#' 
+#' 
+#' @section Note:
+#' Gower (1971) originally defined a similarity measure (\eqn{s}, say)
+#' with values ranging from 0 (completely dissimilar) to 1 (completely similar).
+#' The distance returned here equals \eqn{1-s}.
+#'
+#'
+#' @param x \code{[data.frame]}
+#' @param y \code{[data.frame]}
+#' @param pair_x \code{[numeric|character] (optional)} Columns in \code{x} used for comparison. 
+#'    See Details below.
+#' @param pair_y \code{[numeric|character] (optional)} Columns in \code{y} used for comparison. 
+#'    See Details below.
+#' @param eps \code{[numeric] (optional)} Computed numbers (variable ranges) 
+#'    smaller than \code{eps} are treated as zero. 
+#' @param nthread Number of threads to use for parallelization. By default,
+#'    for a dual-core machine, 2 threads are used. For any other machine 
+#'    n-1 cores are used so your machine doesn't freeze during a big computation. 
+#'    The maximum nr of threads are determined from \code{omp::get_max_threads}.
+#' 
+#' 
+#' @return
+#'   A \code{numeric} vector of length \code{max(nrow(x),nrow(y))}.
+#' 
+#' @seealso \code{\link{gower_topn}}
+#' 
+#' @references 
+#' Gower, John C. "A general coefficient of similarity and some of its 
+#' properties." Biometrics (1971): 857-871.
+#' 
+#' @export
+gower_dist <- function(x, y, pair_x=NULL, pair_y=NULL, eps = 1e-8
+                       ,nthread=getOption("gd_num_thread")){
+  check_recycling(nrow(x),nrow(y))
+  check_recycling(nrow(x),nrow(y))
+  check_recycling(nrow(x),nrow(y))
+  check_recycling(nrow(x),nrow(y))
+  gower_work(x=x,y=y,pair_x=pair_x,pair_y=pair_y,n=NULL,eps=eps,nthread=nthread)
+}
+
+#' Find the top-n matches
+#' 
+#' @description
+#' 
+#' Find the top-n matches in \code{y} for each record in \code{x}.
+#' 
+#' @inheritParams gower_dist
+#' @param n The top-n indices and distances to return.
+#' 
+#' @seealso \code{\link{gower_dist}}
+#' 
+#' @return 
+#'  A \code{list} with two array elements: \code{index}
+#'  and \code{distance}. Both have size \code{n X nrow(x)}. Each ith column 
+#'  corresponds to the top-n best matches of \code{x} with rows in \code{y}.
+#' 
+#' @examples 
+#' # find the top 4 best matches in the iris data set with itself.
+#' x <- iris[1:3,]
+#' lookup <- iris[1:10,]
+#' gower_topn(x=x,y=lookup,n=4)
+#' 
+#' 
+#' @export
+gower_topn <- function(x, y, pair_x=NULL, pair_y = NULL, n=5, eps=1e-8
+                       , nthread=getOption("gd_num_thread")){
+  gower_work(x=x,y=y,pair_x=pair_x,pair_y=pair_y,n=n,eps=eps,nthread)
+}
+
+
+
+gower_work <- function(x, y, pair_x, pair_y, n, eps, nthread){
+  stopifnot(is.numeric(eps), eps>0) 
+  
+  if (is.null(pair_x) & is.null(pair_y)){
+    pair <- match(names(x),names(y),nomatch = 0L)
+  } else if (is.null(pair_x)){
+    pair <- pair_y
+  } else {
+    if (is.character(pair_x) & is.character(pair_y)){
+      m <- match(names(x),pair_x,nomatch=0)
+      pair_x <- pair_x[m]
+      pair_y <- pair_y[m]
+    }
+    pair <- numeric(ncol(x))
+    pair[pair_x] <- pair_y
+  }
+  
+  nthread <- as.integer(nthread)
+  ranges <- numeric(length(pair))
+  for ( i in seq_along(pair)){
+    if (pair[i] == 0 ) next
+    ranges[i] <- .Call("R_get_xy_range",x[[i]],y[[pair[i]]],nthread)
+  }
+
+  factor_pair <- as.integer(sapply(x,is.factor))
+  eps <- as.double(eps)
+  # translate to C-indices (base-0).
+  pair <- as.integer(pair-1L)
+  if (is.null(n)){
+    .Call("R_gower", x, y , ranges, pair, factor_pair, eps, nthread)
+    
+  } else {
+    L <- .Call("R_gower_topn", x, y, ranges, pair, factor_pair, as.integer(n), eps, nthread)
+    names(L) <- c("index","distance")
+    dim(L$index) <- c(n,nrow(x))
+    dim(L$distance) <- dim(L$index)
+    dimnames(L$index) <- list(topn=NULL,row=NULL)
+    dimnames(L$distance) <- dimnames(L$index)
+    L
+  }
+    
+}
+
+RECYCLEWARNING <- tryCatch(1:3+1:2,warning=function(e) e$message)
+
+check_recycling <- function(nx,ny){
+  mx <- max(nx,ny)
+  mn <- min(nx,ny)
+  if ((mx %% mn) != 0) warning(RECYCLEWARNING, call.=FALSE)
+}
+
+
+
+
+
+
+
+
diff --git a/build/vignette.rds b/build/vignette.rds
new file mode 100644
index 0000000..bbd0751
Binary files /dev/null and b/build/vignette.rds differ
diff --git a/inst/doc/intro.R b/inst/doc/intro.R
new file mode 100644
index 0000000..af596a1
--- /dev/null
+++ b/inst/doc/intro.R
@@ -0,0 +1,21 @@
+## ------------------------------------------------------------------------
+library(gower)
+dat1 <- iris[1:10,]
+dat2 <- iris[6:15,]
+gower_dist(dat1, dat2)
+
+## ------------------------------------------------------------------------
+gower_dist(iris[1,], dat1)
+
+## ------------------------------------------------------------------------
+dat1 <- dat2 <- iris[1:10,]
+names(dat2) <- tolower(names(dat2))
+gower_dist(dat1, dat2)
+# tell gower_dist to match columns 1..5 in dat1 with column 1..5 in dat2
+gower_dist(dat1, dat2, pair_y=1:5)
+
+## ------------------------------------------------------------------------
+dat1 <- iris[1:10,]
+L <- gower_topn(x=dat1, y=iris, n=3)
+L
+
diff --git a/inst/doc/intro.Rmd b/inst/doc/intro.Rmd
new file mode 100644
index 0000000..beeb0ae
--- /dev/null
+++ b/inst/doc/intro.Rmd
@@ -0,0 +1,78 @@
+---
+title: "Introduction to the gower package"
+author: "Mark van der Loo"
+date: "`r Sys.Date()`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Introduction to the gower package}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+## Gower's distance measure
+
+Gower's distance can be used to measure how different two records are. The records may contain combinations of logical, numerical, categorical or text data. The distance is always a number between 0 (identical) and 1 
+(maximally dissimilar). An easy to read specification of the measure is given in the original paper.
+
+Gower (1971) [A general coefficient of similarity and some of its properties.](http://venus.unive.it/romanaz/modstat_ba/gowdis.pdf)  _Biometrics_ **27** 857-874.
+
+In short, Gower's distance (or similarity) first computes distances between pairs of variables over two data sets and then combines those distances to a single value per record-pair.
+
+This package modifies Gower's original similarity measure in the following ways.
+
+- In stead of the original similarity _S_, the package returns the distance _1-S_.
+- The original paper does not mention the concept of `NA`. Missing variables are skipped when computing the distance.
+- The original paper does not mention character data. These are treated as categorical data.
+
+## Computing Gower's distance
+
+The function ```gower_dist``` computes pairwise-distances between records.
+
+```{r}
+library(gower)
+dat1 <- iris[1:10,]
+dat2 <- iris[6:15,]
+gower_dist(dat1, dat2)
+```
+If one data frame has less records than the other, the shortest one is recycled over (just like when you're adding two vectors of unequal length)
+
+```{r}
+gower_dist(iris[1,], dat1)
+```
+
+It is possible to control how columns from the two data sets are paired for comparison using the `pair_x` and `pair_y` arguments. This comes in handy when similar colums have different names accross datasets.  By default, columns with matching names are paired. The behaviour is somewhat similar to that of base R's `merge` in that respect.
+
+```{r}
+dat1 <- dat2 <- iris[1:10,]
+names(dat2) <- tolower(names(dat2))
+gower_dist(dat1, dat2)
+# tell gower_dist to match columns 1..5 in dat1 with column 1..5 in dat2
+gower_dist(dat1, dat2, pair_y=1:5)
+```
+
+## Computing the top-n matches
+
+The function `gower_topn` returns a list with two arrays.
+```{r}
+dat1 <- iris[1:10,]
+L <- gower_topn(x=dat1, y=iris, n=3)
+L
+```
+
+The first array is called `index`. Each column corresponds to one row of `x`. The entries of each column
+index the top _n_ best matches of that row in x with rows in `y`. In this example, the best match of the first row of `dat1` is record number ```r L$index[1,1]``` from `iris` (this should be obvious, since they are the same record). The second best match is record number ```r L$index[2,1]``` from `iris`.
+ 
+The second array is called `distance` and it contains the corresponding distances.
+
+
+
+## Parallelization, memory usage
+
+
+The underlying algorithm is implemented in C and parallelized using [OpenMP](http://www.openmp.org). OpenMP is available on most systems that can run R. Please see [this section](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support) of the writing R extensions manual for up-to-details details on which systems are supported. At the time of writing (summer 2016), OSX is the only system not supporting OpenMP out of the box. You can still make it work by installing the [...]
+
+If OpenMP is not supported, the package will still work but the core algorithms will not be parallelized.
+
+This implementation makes no copies of the data in memory. When computing `gower_dist`, two double precision
+arrays of size _max(nrow(x),nrow(y))_ are kept in memory to store intermediate results. When computing the top-n matches, for _k_ cores, _k+2_ double precision arrays of length ```nrow(y)``` are created to store intermediate results at C level.
+
diff --git a/inst/doc/intro.html b/inst/doc/intro.html
new file mode 100644
index 0000000..73e8d75
--- /dev/null
+++ b/inst/doc/intro.html
@@ -0,0 +1,158 @@
+<!DOCTYPE html>
+
+<html xmlns="http://www.w3.org/1999/xhtml">
+
+<head>
+
+<meta charset="utf-8">
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+<meta name="generator" content="pandoc" />
+
+<meta name="viewport" content="width=device-width, initial-scale=1">
+
+<meta name="author" content="Mark van der Loo" />
+
+<meta name="date" content="2017-02-23" />
+
+<title>Introduction to the gower package</title>
+
+
+
+<style type="text/css">code{white-space: pre;}</style>
+<style type="text/css">
+div.sourceCode { overflow-x: auto; }
+table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
+  margin: 0; padding: 0; vertical-align: baseline; border: none; }
+table.sourceCode { width: 100%; line-height: 100%; }
+td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
+td.sourceCode { padding-left: 5px; }
+code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
+code > span.dt { color: #902000; } /* DataType */
+code > span.dv { color: #40a070; } /* DecVal */
+code > span.bn { color: #40a070; } /* BaseN */
+code > span.fl { color: #40a070; } /* Float */
+code > span.ch { color: #4070a0; } /* Char */
+code > span.st { color: #4070a0; } /* String */
+code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
+code > span.ot { color: #007020; } /* Other */
+code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
+code > span.fu { color: #06287e; } /* Function */
+code > span.er { color: #ff0000; font-weight: bold; } /* Error */
+code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
+code > span.cn { color: #880000; } /* Constant */
+code > span.sc { color: #4070a0; } /* SpecialChar */
+code > span.vs { color: #4070a0; } /* VerbatimString */
+code > span.ss { color: #bb6688; } /* SpecialString */
+code > span.im { } /* Import */
+code > span.va { color: #19177c; } /* Variable */
+code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
+code > span.op { color: #666666; } /* Operator */
+code > span.bu { } /* BuiltIn */
+code > span.ex { } /* Extension */
+code > span.pp { color: #bc7a00; } /* Preprocessor */
+code > span.at { color: #7d9029; } /* Attribute */
+code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
+code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
+code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
+code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
+</style>
+
+
+
+<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20bot [...]
+
+</head>
+
+<body>
+
+
+
+
+<h1 class="title toc-ignore">Introduction to the gower package</h1>
+<h4 class="author"><em>Mark van der Loo</em></h4>
+<h4 class="date"><em>2017-02-23</em></h4>
+
+
+
+<div id="gowers-distance-measure" class="section level2">
+<h2>Gower’s distance measure</h2>
+<p>Gower’s distance can be used to measure how different two records are. The records may contain combinations of logical, numerical, categorical or text data. The distance is always a number between 0 (identical) and 1 (maximally dissimilar). An easy to read specification of the measure is given in the original paper.</p>
+<p>Gower (1971) <a href="http://venus.unive.it/romanaz/modstat_ba/gowdis.pdf">A general coefficient of similarity and some of its properties.</a> <em>Biometrics</em> <strong>27</strong> 857-874.</p>
+<p>In short, Gower’s distance (or similarity) first computes distances between pairs of variables over two data sets and then combines those distances to a single value per record-pair.</p>
+<p>This package modifies Gower’s original similarity measure in the following ways.</p>
+<ul>
+<li>In stead of the original similarity <em>S</em>, the package returns the distance <em>1-S</em>.</li>
+<li>The original paper does not mention the concept of <code>NA</code>. Missing variables are skipped when computing the distance.</li>
+<li>The original paper does not mention character data. These are treated as categorical data.</li>
+</ul>
+</div>
+<div id="computing-gowers-distance" class="section level2">
+<h2>Computing Gower’s distance</h2>
+<p>The function <code>gower_dist</code> computes pairwise-distances between records.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(gower)
+dat1 <-<span class="st"> </span>iris[<span class="dv">1</span>:<span class="dv">10</span>,]
+dat2 <-<span class="st"> </span>iris[<span class="dv">6</span>:<span class="dv">15</span>,]
+<span class="kw">gower_dist</span>(dat1, dat2)</code></pre></div>
+<pre><code>##  [1] 0.34606061 0.17939394 0.14303030 0.09636364 0.20424242 0.23636364
+##  [7] 0.16000000 0.19939394 0.19818182 0.45030303</code></pre>
+<p>If one data frame has less records than the other, the shortest one is recycled over (just like when you’re adding two vectors of unequal length)</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">gower_dist</span>(iris[<span class="dv">1</span>,], dat1)</code></pre></div>
+<pre><code>##  [1] 0.0000000 0.1400000 0.1900000 0.2300000 0.0400000 0.4233333 0.1866667
+##  [8] 0.0900000 0.2600000 0.2366667</code></pre>
+<p>It is possible to control how columns from the two data sets are paired for comparison using the <code>pair_x</code> and <code>pair_y</code> arguments. This comes in handy when similar colums have different names accross datasets. By default, columns with matching names are paired. The behaviour is somewhat similar to that of base R’s <code>merge</code> in that respect.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">dat1 <-<span class="st"> </span>dat2 <-<span class="st"> </span>iris[<span class="dv">1</span>:<span class="dv">10</span>,]
+<span class="kw">names</span>(dat2) <-<span class="st"> </span><span class="kw">tolower</span>(<span class="kw">names</span>(dat2))
+<span class="kw">gower_dist</span>(dat1, dat2)</code></pre></div>
+<pre><code>##  [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN</code></pre>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># tell gower_dist to match columns 1..5 in dat1 with column 1..5 in dat2</span>
+<span class="kw">gower_dist</span>(dat1, dat2, <span class="dt">pair_y=</span><span class="dv">1</span>:<span class="dv">5</span>)</code></pre></div>
+<pre><code>##  [1] 0 0 0 0 0 0 0 0 0 0</code></pre>
+</div>
+<div id="computing-the-top-n-matches" class="section level2">
+<h2>Computing the top-n matches</h2>
+<p>The function <code>gower_topn</code> returns a list with two arrays.</p>
+<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">dat1 <-<span class="st"> </span>iris[<span class="dv">1</span>:<span class="dv">10</span>,]
+L <-<span class="st"> </span><span class="kw">gower_topn</span>(<span class="dt">x=</span>dat1, <span class="dt">y=</span>iris, <span class="dt">n=</span><span class="dv">3</span>)
+L</code></pre></div>
+<pre><code>## $index
+##       row
+## topn   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
+##   [1,]    1    2    3    4    5    6    7    8    9    10
+##   [2,]   18   35   48   48   38   17   48   40   39    35
+##   [3,]   28   26   30   31    1   45   12   50   43    31
+## 
+## $distance
+##       row
+## topn          [,1]       [,2]        [,3]       [,4]       [,5]       [,6]
+##   [1,] 0.000000000 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
+##   [2,] 0.008333333 0.01172316 0.008945386 0.01172316 0.01388889 0.01355932
+##   [3,] 0.008945386 0.01233522 0.010169492 0.01450094 0.01388889 0.03177966
+##       row
+## topn         [,7]        [,8]       [,9]       [,10]
+##   [1,] 0.00000000 0.000000000 0.00000000 0.000000000
+##   [2,] 0.02500000 0.005555556 0.01172316 0.008333333
+##   [3,] 0.02622411 0.011723164 0.02838983 0.017278719</code></pre>
+<p>The first array is called <code>index</code>. Each column corresponds to one row of <code>x</code>. The entries of each column index the top <em>n</em> best matches of that row in x with rows in <code>y</code>. In this example, the best match of the first row of <code>dat1</code> is record number <code>1</code> from <code>iris</code> (this should be obvious, since they are the same record). The second best match is record number <code>18</code> from <code>iris</code>.</p>
+<p>The second array is called <code>distance</code> and it contains the corresponding distances.</p>
+</div>
+<div id="parallelization-memory-usage" class="section level2">
+<h2>Parallelization, memory usage</h2>
+<p>The underlying algorithm is implemented in C and parallelized using <a href="http://www.openmp.org">OpenMP</a>. OpenMP is available on most systems that can run R. Please see <a href="https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support">this section</a> of the writing R extensions manual for up-to-details details on which systems are supported. At the time of writing (summer 2016), OSX is the only system not supporting OpenMP out of the box. You can still make  [...]
+<p>If OpenMP is not supported, the package will still work but the core algorithms will not be parallelized.</p>
+<p>This implementation makes no copies of the data in memory. When computing <code>gower_dist</code>, two double precision arrays of size <em>max(nrow(x),nrow(y))</em> are kept in memory to store intermediate results. When computing the top-n matches, for <em>k</em> cores, <em>k+2</em> double precision arrays of length <code>nrow(y)</code> are created to store intermediate results at C level.</p>
+</div>
+
+
+
+<!-- dynamically load mathjax for compatibility with self-contained -->
+<script>
+  (function () {
+    var script = document.createElement("script");
+    script.type = "text/javascript";
+    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
+    document.getElementsByTagName("head")[0].appendChild(script);
+  })();
+</script>
+
+</body>
+</html>
diff --git a/man/gower-package.Rd b/man/gower-package.Rd
new file mode 100644
index 0000000..3c65675
--- /dev/null
+++ b/man/gower-package.Rd
@@ -0,0 +1,9 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/gower-pkg.R
+\docType{package}
+\name{gower-package}
+\alias{gower-package}
+\title{Gower's distance/similarity measure.}
+\description{
+A C-based implementation of Gower's distance.
+}
diff --git a/man/gower_dist.Rd b/man/gower_dist.Rd
new file mode 100644
index 0000000..b884a57
--- /dev/null
+++ b/man/gower_dist.Rd
@@ -0,0 +1,60 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/gower.R
+\name{gower_dist}
+\alias{gower_dist}
+\title{Gower's distance}
+\usage{
+gower_dist(x, y, pair_x = NULL, pair_y = NULL, eps = 1e-08,
+  nthread = getOption("gd_num_thread"))
+}
+\arguments{
+\item{x}{\code{[data.frame]}}
+
+\item{y}{\code{[data.frame]}}
+
+\item{pair_x}{\code{[numeric|character] (optional)} Columns in \code{x} used for comparison. 
+See Details below.}
+
+\item{pair_y}{\code{[numeric|character] (optional)} Columns in \code{y} used for comparison. 
+See Details below.}
+
+\item{eps}{\code{[numeric] (optional)} Computed numbers (variable ranges) 
+smaller than \code{eps} are treated as zero.}
+
+\item{nthread}{Number of threads to use for parallelization. By default,
+for a dual-core machine, 2 threads are used. For any other machine 
+n-1 cores are used so your machine doesn't freeze during a big computation. 
+The maximum nr of threads are determined from \code{omp::get_max_threads}.}
+}
+\value{
+A \code{numeric} vector of length \code{max(nrow(x),nrow(y))}.
+}
+\description{
+Compute Gower's distance, pairwise between records in two data sets \code{x} 
+and \code{y}. Records from the smallest data set are recycled over.
+}
+\section{Details}{
+
+There are three ways to specify which columns of \code{x} should be compared
+with what columns of \code{y}. The first option is do give no specification. 
+In that case columns with matching names will be used. The second option
+is to use only the \code{pairs_y} argument, specifying for each column in \code{x}
+in order, which column in \code{y} must be used to pair it with (use \code{0} 
+to skip a column in \code{x}). The third option is to explicitly specify the
+columns to be matched using \code{pair_x} and \code{pair_y}.
+}
+
+\section{Note}{
+
+Gower (1971) originally defined a similarity measure (\eqn{s}, say)
+with values ranging from 0 (completely dissimilar) to 1 (completely similar).
+The distance returned here equals \eqn{1-s}.
+}
+
+\references{
+Gower, John C. "A general coefficient of similarity and some of its 
+properties." Biometrics (1971): 857-871.
+}
+\seealso{
+\code{\link{gower_topn}}
+}
diff --git a/man/gower_topn.Rd b/man/gower_topn.Rd
new file mode 100644
index 0000000..b6b6450
--- /dev/null
+++ b/man/gower_topn.Rd
@@ -0,0 +1,49 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/gower.R
+\name{gower_topn}
+\alias{gower_topn}
+\title{Find the top-n matches}
+\usage{
+gower_topn(x, y, pair_x = NULL, pair_y = NULL, n = 5, eps = 1e-08,
+  nthread = getOption("gd_num_thread"))
+}
+\arguments{
+\item{x}{\code{[data.frame]}}
+
+\item{y}{\code{[data.frame]}}
+
+\item{pair_x}{\code{[numeric|character] (optional)} Columns in \code{x} used for comparison. 
+See Details below.}
+
+\item{pair_y}{\code{[numeric|character] (optional)} Columns in \code{y} used for comparison. 
+See Details below.}
+
+\item{n}{The top-n indices and distances to return.}
+
+\item{eps}{\code{[numeric] (optional)} Computed numbers (variable ranges) 
+smaller than \code{eps} are treated as zero.}
+
+\item{nthread}{Number of threads to use for parallelization. By default,
+for a dual-core machine, 2 threads are used. For any other machine 
+n-1 cores are used so your machine doesn't freeze during a big computation. 
+The maximum nr of threads are determined from \code{omp::get_max_threads}.}
+}
+\value{
+A \code{list} with two array elements: \code{index}
+ and \code{distance}. Both have size \code{n X nrow(x)}. Each ith column 
+ corresponds to the top-n best matches of \code{x} with rows in \code{y}.
+}
+\description{
+Find the top-n matches in \code{y} for each record in \code{x}.
+}
+\examples{
+# find the top 4 best matches in the iris data set with itself.
+x <- iris[1:3,]
+lookup <- iris[1:10,]
+gower_topn(x=x,y=lookup,n=4)
+
+
+}
+\seealso{
+\code{\link{gower_dist}}
+}
diff --git a/src/Makevars b/src/Makevars
new file mode 100644
index 0000000..f4c9aa0
--- /dev/null
+++ b/src/Makevars
@@ -0,0 +1,4 @@
+
+PKG_CFLAGS = $(SHLIB_OPENMP_CFLAGS)
+PKG_LIBS = $(SHLIB_OPENMP_CFLAGS)
+
diff --git a/src/R_register_native.c b/src/R_register_native.c
new file mode 100644
index 0000000..8e8b62e
--- /dev/null
+++ b/src/R_register_native.c
@@ -0,0 +1,30 @@
+#include <R.h>
+#include <Rinternals.h>
+#include <stdlib.h> // for NULL
+#include <R_ext/Rdynload.h>
+
+/* FIXME: 
+   Check these declarations against the C/Fortran source code.
+*/
+
+/* .Call calls */
+extern SEXP R_get_max_threads();
+extern SEXP R_get_thread_limit();
+extern SEXP R_get_xy_range(SEXP, SEXP, SEXP);
+extern SEXP R_gower(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
+extern SEXP R_gower_topn(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
+
+static const R_CallMethodDef CallEntries[] = {
+    {"R_get_max_threads",  (DL_FUNC) &R_get_max_threads,  0},
+    {"R_get_thread_limit", (DL_FUNC) &R_get_thread_limit, 0},
+    {"R_get_xy_range",     (DL_FUNC) &R_get_xy_range,     3},
+    {"R_gower",            (DL_FUNC) &R_gower,            7},
+    {"R_gower_topn",       (DL_FUNC) &R_gower_topn,       8},
+    {NULL, NULL, 0}
+};
+
+void R_init_gower(DllInfo *dll)
+{
+    R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
+    R_useDynamicSymbols(dll, FALSE);
+}
diff --git a/src/gower.c b/src/gower.c
new file mode 100644
index 0000000..64c40ba
--- /dev/null
+++ b/src/gower.c
@@ -0,0 +1,722 @@
+/*  gower - a C/R implementation of Gower's similarity (or distance) measure.
+ *  Copyright (C) 2016  Mark van der Loo
+ *
+ *  This program is free software: you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation, either version 3 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program.  If not, see <http://www.gnu.org/licenses/>. 
+ *
+ *  You can contact the author at: mark _dot_ vanderloo _at_ gmail _dot_ com
+ *
+ */
+
+
+
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+/* R-windows-oldrel (3.2.x) uses gcc 4.6.3  which we need to detect */
+#ifdef __GNUC__
+#if __GNUC__ <= 4 && __GNUC_MINOR__ <= 6
+#else
+#define HAS_REDUCTION
+#endif
+#endif
+
+
+
+
+#include <math.h>
+#include <R.h>
+#include <Rdefines.h>
+
+#define MAX(X,Y) ((X) > (Y) ? (X) : (Y))
+#define MIN(X,Y) ((X) < (Y) ? (X) : (Y))
+#define RECYCLE(N, K) ((N) + 1L < (K) ? (N) + 1L : 0L )
+
+// determine when something is numerically zero.
+static double EPS = 1e-8;
+// set max nr of threads to use
+static int NTHREAD = 1L;
+
+// recycling in parallel region.
+static inline int recycle(int i, int nthreads, int ni){
+  i += nthreads;
+  if ( i >= ni )
+    i = (nthreads < ni) ? (i - ni) : (i % ni);
+  return i;
+}
+
+
+SEXP R_get_max_threads(){
+  SEXP out = allocVector(INTSXP, 1L);
+  PROTECT(out);
+  INTEGER(out)[0] = 1L;
+  #ifdef _OPENMP
+  INTEGER(out)[0] = omp_get_max_threads();
+  #endif
+  UNPROTECT(1);
+  return out;
+}
+
+
+SEXP R_get_num_procs(){
+  SEXP out = allocVector(INTSXP, 1L);
+  PROTECT(out);
+  INTEGER(out)[0] = 1L;
+  #ifdef _OPENMP
+  INTEGER(out)[0] = omp_get_num_procs();
+  #endif
+  UNPROTECT(1);
+  return out;
+}
+
+SEXP R_get_thread_limit(){
+  SEXP out = allocVector(INTSXP, 1L);
+  PROTECT(out);
+  INTEGER(out)[0] = 1L;
+  #ifdef _OPENMP
+  INTEGER(out)[0] = omp_get_thread_limit();
+  #endif
+  UNPROTECT(1);
+  return out;
+}
+
+
+// presence or absence of a character. x and y are 0 (FALSE) or 1 (TRUE)
+static inline void gower_logi(int *x, int nx, int *y, int ny
+   , double *num, double *den)
+{
+
+  #pragma omp parallel num_threads(NTHREAD) 
+  {
+    int nt = MAX(nx,ny);
+    double dijk, sijk;
+    int i = 0, j = 0;
+    double *inum = num,  *iden=den;
+    
+    int ID = 0, num_threads=1;
+    #ifdef _OPENMP
+    ID = omp_get_thread_num();
+    num_threads = omp_get_num_threads();
+    i = recycle(ID-num_threads, num_threads, nx);
+    j = recycle(ID-num_threads, num_threads, ny);
+    inum += ID;
+    iden += ID;
+    #endif
+    
+
+    for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){
+      dijk = (double) ((x[i] | y[j]) & !((x[i] == NA_INTEGER) | (y[j] == NA_INTEGER)));
+      sijk = (dijk == 1.0) ? (double) (x[i] * y[j]) : 0.0;
+      *inum += dijk * sijk; 
+      *iden += dijk;
+      i = recycle(i, num_threads, nx);
+      j = recycle(j, num_threads, ny);
+    }
+  } // end parallel region.
+}
+
+// equality of categorical variables, encoded as x, y in {1,2,...,N}.
+static inline void gower_cat(int *x, int nx, int *y, int ny 
+  , double *num, double *den)
+{
+
+  #pragma omp parallel num_threads(NTHREAD)
+  {
+    int nt = MAX(nx,ny);
+    double dijk, sijk;
+    int i = 0, j = 0;
+    double *inum = num,  *iden=den;
+    
+    int ID = 0, num_threads=1;
+    #ifdef _OPENMP
+    ID = omp_get_thread_num();
+    num_threads = omp_get_num_threads();
+    i = recycle(ID-num_threads, num_threads, nx);
+    j = recycle(ID-num_threads, num_threads, ny);
+    inum += ID;
+    iden += ID;
+    #endif
+
+    for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){
+      dijk = (double) !(x[i] == NA_INTEGER || y[j] == NA_INTEGER);
+      sijk = (dijk==1.0) ? (double) (x[i] == y[j]) : 0.0; 
+      *inum += dijk * sijk; 
+      *iden += dijk;
+      i = recycle(i, num_threads, nx);
+      j = recycle(j, num_threads, ny);
+    }
+  } // end parallel region.
+
+}
+
+// strings. Treated as categories.
+static inline void gower_str(SEXP x, int nx, SEXP y, int ny, double *num, double *den){
+  #pragma omp parallel num_threads(NTHREAD)
+  {
+    int nt = MAX(nx, ny);
+    double dijk, sijk;
+    int i=0, j=0;
+    double *inum = num,  *iden=den;
+    SEXP xi, yj;
+
+    int ID = 0, num_threads=1;
+    #ifdef _OPENMP
+    ID = omp_get_thread_num();
+    num_threads = omp_get_num_threads();
+    i = recycle(ID-num_threads, num_threads, nx);
+    j = recycle(ID-num_threads, num_threads, ny);
+    inum += ID;
+    iden += ID;
+    #endif
+   
+    for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){
+      xi = STRING_ELT(x,i);
+      yj = STRING_ELT(y,j);
+      dijk = (double) !(xi == NA_STRING || yj == NA_STRING);
+      sijk = (dijk==1.0) ? (double) (CHAR(xi) == CHAR(yj)) : 0.0; 
+      *inum += dijk * sijk; 
+      *iden += dijk;
+      i = recycle(i, num_threads, nx);
+      j = recycle(j, num_threads, ny);
+    }
+  } // end of parallel region 
+}
+
+
+// comparison of numerical variables, by absolute difference divided by range.
+static inline void gower_num(double *x, int nx, double *y, int ny,double R
+    , double *num, double *den)
+{
+  if ( !isfinite(R) || R < EPS ){
+    warning("skipping variable with zero or non-finite range.");
+    return;
+  } 
+  #pragma omp parallel num_threads(NTHREAD)
+  {
+    int nt = MAX(nx,ny);
+    double dijk, sijk;
+    int i = 0, j = 0;
+    double *inum = num,  *iden=den;
+    
+    int ID = 0, num_threads=1;
+    #ifdef _OPENMP
+    ID = omp_get_thread_num();
+    num_threads = omp_get_num_threads();
+    i = recycle(ID-num_threads, num_threads, nx);
+    j = recycle(ID-num_threads, num_threads, ny);
+    inum += ID;
+    iden += ID;
+    #endif
+
+
+    for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){
+      dijk = (double) (isfinite(x[i]) & isfinite(y[j]));
+      sijk = (dijk==1.0) ? (1.0-fabs(x[i]-y[j])/R) : 0.0;
+      (*inum) += dijk * sijk; 
+      (*iden) += dijk;
+      i = recycle(i, num_threads, nx);
+      j = recycle(j, num_threads, ny);
+    }
+    
+  } // end of parallel region
+}
+
+
+static inline void gower_dbl_int(double *x, int nx, int *y, int ny,double R
+    , double *num, double *den)
+{
+
+  if ( !isfinite(R) || R < EPS ){
+    warning("skipping variable with zero or non-finite range\n");
+    return;
+  }
+
+  #pragma omp parallel num_threads(NTHREAD)
+  {
+    int nt = MAX(nx, ny);
+    double dijk, sijk;
+    int i=0, j=0;
+    double *inum = num,  *iden=den;
+
+    int ID = 0, num_threads=1;
+    #ifdef _OPENMP
+    ID = omp_get_thread_num();
+    num_threads = omp_get_num_threads();
+    i = recycle(ID-num_threads, num_threads, nx);
+    j = recycle(ID-num_threads, num_threads, ny);
+    inum += ID;
+    iden += ID;
+    #endif
+   
+    for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){
+      dijk = (double) (isfinite(x[i]) & (y[j] != NA_INTEGER));
+      sijk = (dijk==1.0) ? (1.0-fabs(x[i] - ((double) y[j]) )/R) : 0.0;
+      *inum += dijk * sijk; 
+      *iden += dijk;
+      i = recycle(i, num_threads, nx);
+      j = recycle(j, num_threads, ny);
+    }
+  } // end of parallel region
+}
+
+static inline void gower_int(int *x, int nx, int *y, int ny, double R
+    , double *num, double *den)
+{
+  if ( !isfinite(R) || R == 0 ){
+    warning("skipping variable with zero or non-finite range\n");
+    return;
+  }
+  #pragma omp parallel num_threads(NTHREAD)
+  {
+    int nt = MAX(nx, ny);
+    double dijk, sijk;
+    int i=0, j=0;
+    double *inum = num,  *iden=den;
+
+    int ID = 0, num_threads=1;
+    #ifdef _OPENMP
+    ID = omp_get_thread_num();
+    num_threads = omp_get_num_threads();
+    i = recycle(ID-num_threads, num_threads, nx);
+    j = recycle(ID-num_threads, num_threads, ny);
+    inum += ID;
+    iden += ID;
+    #endif
+
+
+    for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){
+      dijk = (double) ( (x[i] !=NA_INTEGER) & (y[j] != NA_INTEGER));
+      sijk = (dijk==1.0) ? (1.0-fabs( ((double)x[i]) - ((double)y[j]) )/R) : 0.0;
+      *inum += dijk * sijk; 
+      *iden += dijk;
+      i = recycle(i, num_threads, nx);
+      j = recycle(j, num_threads, ny);
+    }
+  } // end of parallel region
+}
+
+// range computations
+static void get_dbl_range(double *x, int nx, double *min, double *max){
+
+  double *ix = x;
+
+
+  double imin=*ix, imax=*ix;
+
+  for ( int i=0; i<nx; i++, ix++ ){
+    imin = *ix; 
+    imax = *ix;
+    if (isfinite(*ix)) break;
+  }
+  
+  // all non-finite, range not computable.
+  if ( !isfinite(imin) ){
+    return ;
+  }
+
+  #ifdef HAS_REDUCTION
+  #pragma omp parallel num_threads(NTHREAD)
+  #endif
+  {
+    #ifdef HAS_REDUCTION
+    #pragma omp for reduction(min:imin), reduction(max:imax)
+    #endif
+    for ( int i=0; i<nx; i++){
+      if (isfinite(x[i])){
+        if (x[i] > imax){
+          imax = x[i];
+        } else if ( x[i] < imin ){
+          imin = x[i];
+        }
+      }
+    }
+  }// end parallel region
+  *min = imin;
+  *max = imax;
+}
+
+
+static void get_int_range(int *x, int nx, double *min, double *max){
+
+  int *ix = x;
+
+  int imin = x[0]
+    , imax = x[0];
+
+  for ( int i=0; i<nx; i++, ix++ ){
+    if ( imin != NA_INTEGER ) break;
+    imin = *ix; 
+    imax = *ix;
+  }
+  
+  // all missing, range not computable.
+  if ( imin == NA_INTEGER ){
+    return ;
+  }
+
+   
+  #ifdef HAS_REDUCTION
+  #pragma omp parallel for reduction(min:imin), reduction(max:imax)
+  #endif
+  for ( int i=0; i<nx; i++){
+    if ( x[i] != NA_INTEGER ){
+      if (x[i] > imax){
+        imax = x[i];
+      } else if ( x[i] < imin ){
+        imin = x[i];
+      }
+    }
+  }
+  *min = (double) imin;
+  *max = (double) imax;
+}
+
+static void get_range(SEXP x, double *min, double *max){
+
+  switch( TYPEOF(x) ){
+    case INTSXP : {
+      get_int_range(INTEGER(x), length(x), min, max);
+      break;
+    }
+    case REALSXP : {
+      get_dbl_range(REAL(x), length(x), min, max);
+      break;
+    }
+  }
+
+}
+
+static double get_xy_range(SEXP x, SEXP y){
+
+  double x_min = R_NegInf
+      , x_max = R_PosInf
+      , y_min = R_NegInf
+      , y_max = R_PosInf
+      , min, max;
+
+  get_range(x, &x_min, &x_max);
+  get_range(y, &y_min, &y_max);
+
+  if ( isfinite(x_min) & isfinite(y_min) ){
+    min = MIN(x_min, y_min);
+  } else if ( isfinite(x_min) & !(isfinite(y_min)) ){
+    min = x_min;
+  } else if ( (!isfinite(x_min)) & isfinite(y_min) ) {
+    min = y_min;
+  } else {
+    min = NA_REAL;
+  }
+
+  if ( isfinite(x_max) & isfinite(y_min) ){
+    max = MAX(x_max, y_max);
+  } else if ( isfinite(x_max) & !isfinite(y_max) ){
+    max = x_max;
+  } else if ( (!isfinite(x_max)) & isfinite(y_max) ){
+    max = y_max;
+  } else {
+    max = NA_REAL;
+  }
+
+  return max - min;
+
+}
+
+SEXP R_get_xy_range(SEXP x_, SEXP y_, SEXP nthread_){
+  NTHREAD = INTEGER(nthread_)[0];
+  SEXP out = allocVector(REALSXP,1L);
+  PROTECT(out);
+  REAL(out)[0] = get_xy_range(x_, y_);
+  UNPROTECT(1);
+  return out;
+}
+
+
+/*
+static void print_vec(double *x, int n){
+  for ( int i=0; i<n; i++){
+    Rprintf("(%d) = %4.3f, ",i,x[i]);
+  }
+  Rprintf("\n");
+}*/
+
+static void do_gower(
+  SEXP x               // a data.frame
+  , SEXP y             // another data.frame
+  , SEXP ranges_       // dbl; ranges[i] is range of variable (x[i],y[pair[i]])
+  , SEXP pair_         // int; pair[i] is index in y
+  , SEXP factor_pair_  // int; 0 if not factor
+  , SEXP eps_          // dbl; numerical zero
+  , SEXP nthread_      // int; requested nr of threads
+  , double *work       // dbl; of length max(nrow(x),nrow(y))
+  , SEXP out_)         // dbl; output, length equals work.
+{
+
+  int *pair = INTEGER(pair_)
+    , *factor_pair = INTEGER(factor_pair_);
+  int npair = length(pair_);
+
+  double *ranges = REAL(ranges_);
+
+  NTHREAD = INTEGER(nthread_)[0];
+
+  // set global epsilon
+  EPS = REAL(eps_)[0];
+
+
+  int nrow_x = length(VECTOR_ELT(x, 0L))
+    , nrow_y = length(VECTOR_ELT(y, 0L));
+  int nt = MAX(nrow_x, nrow_y);
+
+  // numerator & denominator. 
+  double *num = REAL(out_)
+       , *den = work;
+
+  // initialize work- and output space
+  double *iden = den, *inum = num;
+  for ( int j=0; j<nt; j++, iden++, inum++){
+    *iden = 0.0;
+    *inum = 0.0;
+  }
+
+  int type_y;
+  double R;
+
+  // loop over columns of x, compare with paired columns in y.
+  for ( int j = 0; j < npair; j++){
+    if (pair[j] == -1L) continue; // no paired column.
+    switch( TYPEOF(VECTOR_ELT(x,j)) ) {
+      case LGLSXP : 
+        gower_logi(INTEGER(VECTOR_ELT(x,j)), nrow_x
+            , INTEGER(VECTOR_ELT(y,pair[j])), nrow_y
+            ,num, den);
+        break;
+      case REALSXP : 
+        R = ranges[j];
+        if (TYPEOF(VECTOR_ELT(y,pair[j])) == REALSXP){
+          gower_num(REAL(VECTOR_ELT(x,j)), nrow_x
+                , REAL(VECTOR_ELT(y,pair[j])), nrow_y
+                , R, num, den);
+        } else if (TYPEOF(VECTOR_ELT(y,pair[j])) == INTSXP) {
+          gower_dbl_int(REAL(VECTOR_ELT(x,j)), nrow_x
+                , INTEGER(VECTOR_ELT(y,pair[j])), nrow_y
+                , R, num, den);
+        }
+        break;
+      case INTSXP : 
+        type_y = TYPEOF(VECTOR_ELT(y,pair[j]));
+        if ( type_y == REALSXP ){ // treat as numeric
+          R = ranges[j];
+          gower_dbl_int(REAL(VECTOR_ELT(y,pair[j])), nrow_y
+                , INTEGER(VECTOR_ELT(x, j)), nrow_x
+                , R, num, den);
+        } else if ( type_y == INTSXP ){
+          if ( factor_pair[j] ){ // factor variables
+            gower_cat(INTEGER(VECTOR_ELT(x,j)), nrow_x
+                    , INTEGER(VECTOR_ELT(y,pair[j])), nrow_y
+                    , num, den);
+          } else { // treat as integers
+            R = ranges[j];
+            gower_int(INTEGER(VECTOR_ELT(x,j)), nrow_x
+                    , INTEGER(VECTOR_ELT(y,pair[j])), nrow_y
+                    , R, num, den);
+          }
+        } 
+        break;
+      case STRSXP :
+        gower_str(VECTOR_ELT(x,j),nrow_x,VECTOR_ELT(y,pair[j]),nrow_y, num, den);
+    } // end switch
+  } // end for
+
+  inum = num;
+  iden = den;
+  for (int i=0; i<nt; i++, inum++, iden++){
+    (*inum) = (*iden == 0.0) ? R_NaN : (1.0 - (*inum)/(*iden));
+  }
+
+}
+
+
+
+SEXP R_gower(SEXP x, SEXP y, SEXP ranges_, SEXP pair_
+    , SEXP factor_pair_, SEXP eps_, SEXP nthread_){
+
+
+  int nrow_x = length(VECTOR_ELT(x, 0L))
+    , nrow_y = length(VECTOR_ELT(y, 0L));
+  int nt = MAX(nrow_x, nrow_y);
+
+  // Setup space for output.
+  SEXP out_;
+  out_ = PROTECT(allocVector(REALSXP, nt));
+  // Make room for the workhorse
+  double *work = (double *) R_alloc(nt, sizeof(double));
+  // Make the horse work
+  do_gower(x, y, ranges_, pair_, factor_pair_, eps_, nthread_, work, out_);
+
+  // cleanup and return
+  UNPROTECT(1);
+  return(out_);
+
+}
+
+
+
+/* Push a value and an index down the list while keeping things sorted.
+ * 
+ * x    : value to push down
+ * ind  : index value
+ * topn : list, initialized with 1./0.
+ * index: list, initialized with 0L
+ * n    : nr of values in the list.
+ */
+static inline void push(double x, int ind, double *topn, int *index, int n){
+  
+  for ( int i=0; i<n; i++){
+    if ( (x < topn[i]) | ((x == topn[i]) & (i < n-1))){
+      // new entry encountered, bubble to keep sorted.
+      if ( topn[i] == R_PosInf ){ // No entry yet, overwrite.
+        topn[i] = x;
+        index[i] = ind;
+        break; // out of main loop
+      } else { // Bubble up to insert new entry
+        topn[n-1] = topn[n-2];
+        index[n-1] = index[n-2];
+        for ( int j=n-2; j>i; j-- ){
+          topn[j] = topn[j-1];
+          index[j] = index[j-1];
+        }
+        topn[i] = x;
+        index[i] = ind;
+        break; // out of main loop
+      }
+    }
+  }
+
+} 
+
+/* For testing purposes
+SEXP R_pushdown(SEXP entry_, SEXP index_, SEXP values_, SEXP indices_){
+  double  entry   = REAL(entry_)[0];
+  int     index   = INTEGER(index_)[0];
+  double *values  = REAL(values_);
+  int    *indices = INTEGER(indices_);
+  int n = length(values_);
+  
+  push(entry, index, values, indices, n);
+  return R_NilValue;
+}
+*/
+
+
+
+static inline void copyrec(SEXP into, SEXP from, int i){
+  int ncol = length(into);
+
+  SEXP col_from, col_into;
+
+  for ( int j = 0; j < ncol; j++){
+    col_from = VECTOR_ELT(from,j);
+    col_into = VECTOR_ELT(into,j);
+    switch(TYPEOF(col_from)){
+      case LGLSXP  : { INTEGER(col_into)[0] = INTEGER(col_from)[i]; break;}
+      case INTSXP  : { INTEGER(col_into)[0] = INTEGER(col_from)[i]; break;}
+      case REALSXP : { REAL(col_into)[0] = REAL(col_from)[i]; break;}
+      case STRSXP  : { SET_STRING_ELT(col_from, 0, STRING_ELT(col_from,i)); break;}
+    }
+  }
+}
+
+/* for testing purposes only
+void prvec(SEXP x){
+  for (int i=0; i<length(x); i++){
+    Rprintf("%8.4f",REAL(x)[i]);
+  }
+Rprintf("\n");
+}
+*/
+
+SEXP R_gower_topn(SEXP x_, SEXP y_, SEXP ranges_, SEXP pair_
+    , SEXP factor_pair_, SEXP n_, SEXP eps_, SEXP nthread_){
+
+  int n = INTEGER(n_)[0];
+  int ny = length(VECTOR_ELT(y_,0));
+  int nrowx = length(VECTOR_ELT(x_,0)); 
+  int nout = nrowx * n;
+  int nt = MAX(ny,nrowx);
+
+  // setup output list
+  SEXP out = allocVector(VECSXP, 2L);
+  PROTECT(out);
+  SET_VECTOR_ELT(out, 0L, allocVector(INTSXP, nout));
+  SET_VECTOR_ELT(out, 1L, allocVector(REALSXP, nout));
+
+  // temporary record to pass to R_gower
+  SEXP temprec_ = allocVector(VECSXP, length(x_));
+  PROTECT(temprec_);
+
+  // Room for do_gower workhorse
+  double *work = (double *) R_alloc(nt, sizeof(double));
+ 
+  // initialize output distance.
+  double *vv=REAL(VECTOR_ELT(out,1L));
+  int *ii=INTEGER(VECTOR_ELT(out,0L));
+  #pragma omp parallel num_threads(NTHREAD) 
+  { 
+    #pragma omp for schedule(static)
+    for(int i=0; i<nout; i++){
+      vv[i] = R_PosInf;
+      ii[i] = 0L;
+    }
+  }
+
+
+  // pointers to output
+  int *index = INTEGER(VECTOR_ELT(out, 0L));
+  double *value = REAL(VECTOR_ELT(out, 1L));
+
+  // initialize spots in temporary record
+  for ( int j=0; j<length(x_); j++){
+    SET_VECTOR_ELT(temprec_, j, allocVector(TYPEOF(VECTOR_ELT(x_,j)), 1L));
+  }
+
+  // temporarily stores distance values.
+  SEXP d_ = allocVector(REALSXP,nt);
+  PROTECT(d_);
+  double *dist; 
+  // loop over records in x_
+  for ( int i=0; i < nrowx; i++){
+    // create a list to feed to R_gower
+    copyrec(temprec_, x_, i);
+    // compute distances
+    do_gower(temprec_, y_, ranges_, pair_, factor_pair_, eps_, nthread_, work, d_);
+    // push down distances & indices.
+    dist = REAL(d_);
+    for ( int k=0; k < ny; k++, dist++){
+      push(*dist, k+1, value, index, n);
+    }
+    value += n;
+    index += n;
+  }
+
+
+  // return list with indices and distances.
+  UNPROTECT(3);
+  return(out);
+}
+
+
+
+
+
+
diff --git a/tests/testthat.R b/tests/testthat.R
new file mode 100644
index 0000000..9d76de0
--- /dev/null
+++ b/tests/testthat.R
@@ -0,0 +1,3 @@
+if ( require(testthat) ) test_check("gower")
+
+
diff --git a/tests/testthat/test_gower.R b/tests/testthat/test_gower.R
new file mode 100644
index 0000000..145683c
--- /dev/null
+++ b/tests/testthat/test_gower.R
@@ -0,0 +1,118 @@
+
+context("Basic distance elements")
+
+test_that("distance between logicals",{
+  dL <- expand.grid(c(TRUE,FALSE),c(TRUE,FALSE))
+  x = data.frame(x=dL[,1])
+  y = data.frame(x=dL[,2])
+  expect_equal(gower_dist(x = x,y = y),c(0,1,1,NaN))
+})
+
+test_that("distance between factor variables",{
+  bands <- c("Grand Magus","Skull Fist")
+  dF <- expand.grid(bands,bands)
+  expect_equal(gower_dist(data.frame(x=dF[,1]),data.frame(x=dF[,2])),c(0,1,1,0))
+})
+
+test_that("distance between numerical variables",{
+  dN <- data.frame(x = as.numeric(1:4),y=as.numeric(c(1,1,2,3)))
+  expect_equal(gower_dist(data.frame(x=dN[,1]),data.frame(x=dN[,2])),c(0,1/3,1/3,1/3))
+})
+
+test_that("distance between character variables",{
+  dC <- data.frame(x=letters[1:3],y=letters[3:1],stringsAsFactors=FALSE)
+  expect_equal(gower_dist( 
+    data.frame(x=dC[,1],stringsAsFactors=FALSE)
+    , data.frame(x=dC[,2],stringsAsFactors=FALSE)),c(1,0,1))
+
+})
+
+test_that("multivariate dataset",{
+  bands <- c("Grand Magus","Skull Fist")
+  dL <- expand.grid(c(TRUE,FALSE),c(TRUE,FALSE))
+  dN <- data.frame(x = as.numeric(1:4),y=as.numeric(c(1,1,2,3)))
+  dF <- expand.grid(bands,bands)
+  dM1 <- data.frame(x=dL[,1],y=dF[,1],z=dN[,1])  
+  dM2 <- data.frame(x=dL[,2],y=dF[,2],z=dN[,2])
+  expect_equal(gower_dist(x=dM1,y=dM2), c(0,7/9,7/9,1/6))
+  # check symmetry
+  expect_equal(gower_dist(dM1,dM2),gower_dist(dM2,dM1))
+  # not counting NA's in the denominator
+  dM1[array(c(2,3,4,1,2,3),dim=c(3,2))] <- NA
+  expect_equal(gower_dist(dM1,dM2), c(0,3/4,3/4,0))
+})
+
+test_that("recycling",{
+  expect_equal(length(gower_dist(x=iris[1,],y=iris)), nrow(iris))
+  expect_equal(length(gower_dist(x=iris,y=iris[1,])), nrow(iris))
+  expect_equal(length(gower_dist(x=iris[1:3,],y=iris)), nrow(iris))
+  expect_equal(length(gower_dist(x=iris,y=iris[1:3,])), nrow(iris))
+})
+
+
+test_that("exceptions",{
+  expect_warning(gower_dist(
+    x = data.frame(x=c(1.2,1.2,1.2))
+    , y = data.frame(x=c(1.2,1.2,1.2))
+  ))
+  expect_warning(gower_dist(
+    x = data.frame(x=c(1.2,1.2,1.2))
+    , y = data.frame(x=c(1.2,1.2,1.3))
+    , eps=0.2
+  ))
+
+  expect_warning(gower_dist(data.frame(x=rep(1,100)), data.frame(x=1,100)))
+})
+
+
+context("Top-n")
+
+test_that("gower_topn",{
+  d1 <- iris[1:3,]
+  d2 <- iris[1:7,]
+  L <- gower_topn(d1,d2,n=4)
+  expect_equal(length(L),2)
+  expect_equal(dim(L[[1]]),c(4,3))
+  expect_equal(dim(L[[1]]),dim(L[[2]]))
+  expect_equal(L[[1]][1,],1:3)
+  expect_equal(L[[2]][1,],rep(0,3))
+  
+  # case where n exceeds nr of records in the lookup table.
+  L <- gower_topn(d1,d2,n=8)
+  expect_equal(L[[1]][8,],rep(0,3))
+  expect_equal(L[[2]][8,],rep(Inf,3))
+  
+})
+
+
+test_that("just to get code-coverage right",{
+  dat1 <- data.frame(
+    x = as.factor(sample(letters,2000,replace=TRUE))
+    ,y = sample(LETTERS,2000,replace=TRUE)
+    ,z = as.integer(1:2000)
+    ,w = sample(c(TRUE,FALSE),2000,replace=TRUE)
+    , stringsAsFactors=FALSE
+  )
+  i <- sample(2000)
+  dat2 <- dat1[i,]
+  gower_dist(dat1,dat2)
+})
+
+
+test_that("warning on recycling",{
+  expect_warning(gower_dist(iris[1:3,],iris[1:2,]))
+  expect_warning(gower_dist(iris[1:2,],iris[1:3,]))
+})
+
+context("regression tests")
+test_that("topn w/n=1",{
+  dat <- data.frame(x = c(NA,2,4,5), y = c(6,7,NA,10))
+  L <- gower_topn(dat[c(1,3),],dat[c(2,4),],n=1)
+  expect_equivalent(as.vector(L$index),c(1,2))
+})
+
+
+
+
+
+
diff --git a/vignettes/intro.Rmd b/vignettes/intro.Rmd
new file mode 100644
index 0000000..beeb0ae
--- /dev/null
+++ b/vignettes/intro.Rmd
@@ -0,0 +1,78 @@
+---
+title: "Introduction to the gower package"
+author: "Mark van der Loo"
+date: "`r Sys.Date()`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Introduction to the gower package}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+## Gower's distance measure
+
+Gower's distance can be used to measure how different two records are. The records may contain combinations of logical, numerical, categorical or text data. The distance is always a number between 0 (identical) and 1 
+(maximally dissimilar). An easy to read specification of the measure is given in the original paper.
+
+Gower (1971) [A general coefficient of similarity and some of its properties.](http://venus.unive.it/romanaz/modstat_ba/gowdis.pdf)  _Biometrics_ **27** 857-874.
+
+In short, Gower's distance (or similarity) first computes distances between pairs of variables over two data sets and then combines those distances to a single value per record-pair.
+
+This package modifies Gower's original similarity measure in the following ways.
+
+- In stead of the original similarity _S_, the package returns the distance _1-S_.
+- The original paper does not mention the concept of `NA`. Missing variables are skipped when computing the distance.
+- The original paper does not mention character data. These are treated as categorical data.
+
+## Computing Gower's distance
+
+The function ```gower_dist``` computes pairwise-distances between records.
+
+```{r}
+library(gower)
+dat1 <- iris[1:10,]
+dat2 <- iris[6:15,]
+gower_dist(dat1, dat2)
+```
+If one data frame has less records than the other, the shortest one is recycled over (just like when you're adding two vectors of unequal length)
+
+```{r}
+gower_dist(iris[1,], dat1)
+```
+
+It is possible to control how columns from the two data sets are paired for comparison using the `pair_x` and `pair_y` arguments. This comes in handy when similar colums have different names accross datasets.  By default, columns with matching names are paired. The behaviour is somewhat similar to that of base R's `merge` in that respect.
+
+```{r}
+dat1 <- dat2 <- iris[1:10,]
+names(dat2) <- tolower(names(dat2))
+gower_dist(dat1, dat2)
+# tell gower_dist to match columns 1..5 in dat1 with column 1..5 in dat2
+gower_dist(dat1, dat2, pair_y=1:5)
+```
+
+## Computing the top-n matches
+
+The function `gower_topn` returns a list with two arrays.
+```{r}
+dat1 <- iris[1:10,]
+L <- gower_topn(x=dat1, y=iris, n=3)
+L
+```
+
+The first array is called `index`. Each column corresponds to one row of `x`. The entries of each column
+index the top _n_ best matches of that row in x with rows in `y`. In this example, the best match of the first row of `dat1` is record number ```r L$index[1,1]``` from `iris` (this should be obvious, since they are the same record). The second best match is record number ```r L$index[2,1]``` from `iris`.
+ 
+The second array is called `distance` and it contains the corresponding distances.
+
+
+
+## Parallelization, memory usage
+
+
+The underlying algorithm is implemented in C and parallelized using [OpenMP](http://www.openmp.org). OpenMP is available on most systems that can run R. Please see [this section](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support) of the writing R extensions manual for up-to-details details on which systems are supported. At the time of writing (summer 2016), OSX is the only system not supporting OpenMP out of the box. You can still make it work by installing the [...]
+
+If OpenMP is not supported, the package will still work but the core algorithms will not be parallelized.
+
+This implementation makes no copies of the data in memory. When computing `gower_dist`, two double precision
+arrays of size _max(nrow(x),nrow(y))_ are kept in memory to store intermediate results. When computing the top-n matches, for _k_ cores, _k+2_ double precision arrays of length ```nrow(y)``` are created to store intermediate results at C level.
+

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/r-cran-gower.git



More information about the debian-science-commits mailing list