[mlpack] 33/53: Add approximate furthest neighbor search tutorial.

Barak A. Pearlmutter barak+git at pearlmutter.net
Mon Nov 14 00:46:49 UTC 2016


This is an automated email from the git hooks/post-receive script.

bap pushed a commit to branch master
in repository mlpack.

commit 5ca6936f03af839e3bf58752c1b7125fc6e20960
Author: Ryan Curtin <ryan at ratml.org>
Date:   Sun Oct 30 20:50:52 2016 +0900

    Add approximate furthest neighbor search tutorial.
---
 doc/tutorials/approx_kfn/approx_kfn.txt | 1025 +++++++++++++++++++++++++++++++
 1 file changed, 1025 insertions(+)

diff --git a/doc/tutorials/approx_kfn/approx_kfn.txt b/doc/tutorials/approx_kfn/approx_kfn.txt
new file mode 100644
index 0000000..aa477a0
--- /dev/null
+++ b/doc/tutorials/approx_kfn/approx_kfn.txt
@@ -0,0 +1,1025 @@
+/*!
+
+ at file approx_kfn.txt
+ at author Ryan Curtin
+ at brief Tutorial for how to use approximate furthest neighbor search in mlpack.
+
+ at page akfntutorial Approximate furthest neighbor search (mlpack_approx_kfn) tutorial
+
+ at section intro_akfntut Introduction
+
+\b mlpack implements multiple strategies for approximate furthest neighbor
+search in its \c mlpack_approx_kfn and \c mlpack_kfn programs (each program
+corresponds to different techniques).  This tutorial discusses what problems
+these algorithms solve and how to use each of the techniques that \b mlpack
+implements.
+
+\b mlpack implements five approximate furthest neighbor search algorithms:
+
+ - brute-force search (in \c mlpack_kfn)
+ - single-tree search (in \c mlpack_kfn)
+ - dual-tree search (in \c mlpack_kfn)
+ - query-dependent approximate furthest neighbor (QDAFN) (in \c mlpack_approx_kfn)
+ - DrusillaSelect (in \c mlpack_approx_kfn)
+
+These methods are described in the following papers:
+
+ at code
+ at inproceedings{curtin2013tree,
+  title={Tree-Independent Dual-Tree Algorithms},
+  author={Curtin, Ryan R. and March, William B. and Ram, Parikshit and Anderson,
+      David V. and Gray, Alexander G. and Isbell Jr., Charles L.},
+  booktitle={Proceedings of The 30th International Conference on Machine
+      Learning (ICML '13)},
+  pages={1435--1443},
+  year={2013}
+}
+ at endcode
+
+ at code
+ at incollection{pagh2015approximate,
+  title={Approximate furthest neighbor in high dimensions},
+  author={Pagh, Rasmus and Silvestri, Francesco and Sivertsen, Johan and Skala,
+      Matthew},
+  booktitle={Similarity Search and Applications},
+  pages={3--14},
+  year={2015},
+  publisher={Springer}
+}
+ at endcode
+
+ at code
+ at incollection{curtin2016fast,
+  title={Fast approximate furthest neighbors with data-dependent candidate
+      selection},
+  author={Curtin, Ryan R., and Gardner, Andrew B.},
+  booktitle={Similarity Search and Applications},
+  pages={221--235},
+  year={2016},
+  publisher={Springer}
+}
+ at endcode
+
+The problem of furthest neighbor search is simple, and is the opposite of the
+much-more-studied nearest neighbor search problem.  Given a set of reference
+points \f$R\f$ (the set in which we are searching), and a set of query points
+\f$Q\f$ (the set of points for which we want the furthest neighbor), our goal is
+to return the \f$k\f$ furthest neighbors for each query point in \f$Q\f$:
+
+\f[
+\operatorname{k-argmax}_{p_r \in R} d(p_q, p_r).
+\f]
+
+In order to solve this problem, \b mlpack provides a number of interfaces.
+
+ - two \ref cli_akfntut "simple command-line executables" to calculate
+   approximate furthest neighbors
+ - a simple \ref cpp_qdafn_akfntut "C++ class for QDAFN"
+ - a simple \ref cpp_ds_akfntut "C++ class for DrusillaSelect"
+ - a simple \ref cpp_kfn_akfntut "C++ class for tree-based and brute-force"
+   search
+
+ at section toc_akfntut Table of Contents
+
+A list of all the sections this tutorial contains.
+
+ - \ref intro_akfntut
+ - \ref toc_akfntut
+ - \ref which_akfntut
+ - \ref cli_akfntut
+   - \ref cli_ex1_akfntut
+   - \ref cli_ex2_akfntut
+   - \ref cli_ex3_akfntut
+   - \ref cli_ex4_akfntut
+   - \ref cli_ex5_akfntut
+   - \ref cli_ex6_akfntut
+   - \ref cli_ex7_akfntut
+   - \ref cli_ex8_akfntut
+   - \ref cli_final_akfntut
+ - \ref cpp_ds_akfntut
+   - \ref cpp_ex1_ds_akfntut
+   - \ref cpp_ex2_ds_akfntut
+   - \ref cpp_ex3_ds_akfntut
+   - \ref cpp_ex4_ds_akfntut
+   - \ref cpp_ex5_ds_akfntut
+ - \ref cpp_qdafn_akfntut
+   - \ref cpp_ex1_qdafn_akfntut
+   - \ref cpp_ex2_qdafn_akfntut
+   - \ref cpp_ex3_qdafn_akfntut
+   - \ref cpp_ex4_qdafn_akfntut
+   - \ref cpp_ex5_qdafn_akfntut
+ - \ref cpp_ns_akfntut
+   - \ref cpp_ex1_ns_akfntut
+   - \ref cpp_ex2_ns_akfntut
+   - \ref cpp_ex3_ns_akfntut
+   - \ref cpp_ex4_ns_akfntut
+ - \ref further_doc_akfntut
+
+ at section which_akfntut Which algorithm should be used?
+
+There are three algorithms for furthest neighbor search that \b mlpack
+implements, and each is suited to a different setting.  Below is some basic
+guidance on what should be used.  Note that the question of "which algorithm
+should be used" is a very difficult question to answer, so the guidance below is
+just that---guidance---and may not be right for a particular problem.
+
+ - \c DrusillaSelect is very fast and will perform extremely well for datasets
+   with outliers or datasets with structure (like low-dimensional datasets
+   embedded in high dimensions)
+ - \c QDAFN is a random approach and therefore should be well-suited for
+   datasets with little to no structure
+ - The tree-based approaches (the \c KFN class and the \c mlpack_kfn program) is
+   best suited for low-dimensional datasets, and is most effective when very
+   small levels of approximation are desired, or when exact results are desired.
+ - Dual-tree search is most useful when the query set is large and structured
+   (like for all-furthest-neighbor search).
+ - Single-tree search is more useful when the query set is small.
+
+ at section cli_akfntut Command-line 'mlpack_approx_kfn' and 'mlpack_kfn'
+
+\b mlpack provides two command-line programs to solve approximate furthest
+neighbor search:
+
+ - \c mlpack_approx_kfn, for the QDAFN and DrusillaSelect approaches
+ - \c mlpack_kfn, for exact and approximate tree-based approaches
+
+These two programs allow a large number of algorithms to be used to find
+approximate furthest neighbors.  Note that the \c mlpack_kfn program is also 
+documented by the \ref cli_nstut section of the \ref nstutorial page, as it
+shares options with the \c mlpack_knn program.
+
+Below are several examples of how the \c mlpack_approx_kfn and \c mlpack_kfn
+programs might be used.  The first examples focus on the \c mlpack_approx_kfn
+program, and the last few show how \c mlpack_kfn can be used to produce
+approximate results.
+
+ at subsection cli_ex1_akfntut Calculate 5 furthest neighbors with default options
+
+Here we have a query dataset \c queries.csv and a reference dataset \c refs.csv
+and we wish to find the 5 furthest neighbors of every query point in the
+reference dataset.  We may do that with the \c mlpack_approx_kfn algorithm,
+using the default of the \c DrusillaSelect algorithm with default parameters.
+
+ at code
+$ mlpack_approx_kfn -q queries.csv -r refs.csv -v -k 5 -n n.csv -d d.csv
+[INFO ] Loading 'refs.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Building DrusillaSelect model...
+[INFO ] Model built.
+[INFO ] Loading 'queries.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Searching for 5 furthest neighbors with DrusillaSelect...
+[INFO ] Search complete.
+[INFO ] Saving CSV data to 'n.csv'.
+[INFO ] Saving CSV data to 'd.csv'.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: ds
+[INFO ]   calculate_error: false
+[INFO ]   distances_file: d.csv
+[INFO ]   exact_distances_file: ""
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: ""
+[INFO ]   k: 5
+[INFO ]   neighbors_file: n.csv
+[INFO ]   num_projections: 5
+[INFO ]   num_tables: 5
+[INFO ]   output_model_file: ""
+[INFO ]   query_file: queries.csv
+[INFO ]   reference_file: refs.csv
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   drusilla_select_construct: 0.000342s
+[INFO ]   drusilla_select_search: 0.000780s
+[INFO ]   loading_data: 0.010689s
+[INFO ]   saving_data: 0.005585s
+[INFO ]   total_time: 0.018592s
+ at endcode
+
+Convenient timers for parts of the program operation are printed.  The results,
+saved in \c n.csv and \c d.csv, indicate the furthest neighbors and distances
+for each query point.  The row of the output file indicates the query point that
+the results are for.  The neighbors are listed from furthest to nearest; so, the
+4th element in the 3rd row of \c d.csv indicates the distance between the 3rd
+query point in \c queries.csv and its approximate 4th furthest neighbor.
+Similarly, the same element in \c n.csv indicates the index of the approximate
+4th furthest neighbor (with respect to \c refs.csv).
+
+ at subsection cli_ex2_akfntut Specifying algorithm parameters for DrusillaSelect
+
+The \c -p (\c --num_projections) and \c -t (\c --num_tables) parameters affect
+the running of the \c DrusillaSelect algorithm and the QDAFN algorithm.
+Specifically, larger values for each of these parameters will search more
+possible candidate furthest neighbors and produce better results (at the cost of
+runtime).  More details on how each of these parameters works is available in
+the original papers, the \b mlpack source, or the documentation given by
+\c --help.
+
+In the example below, we run \c DrusillaSelect to find 4 furthest neighbors
+using 10 tables and 2 points in each table.  In this case we have chosen to omit
+the \c -n \c n.csv option, meaning that only the output candidate distances will
+be written to \c d.csv.
+
+ at code
+$ mlpack_approx_kfn -q queries.csv -r refs.csv -v -k 4 -n n.csv -d d.csv -t 10 -p 2
+[INFO ] Loading 'refs.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Building DrusillaSelect model...
+[INFO ] Model built.
+[INFO ] Loading 'queries.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Searching for 4 furthest neighbors with DrusillaSelect...
+[INFO ] Search complete.
+[INFO ] Saving CSV data to 'n.csv'.
+[INFO ] Saving CSV data to 'd.csv'.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: ds
+[INFO ]   calculate_error: false
+[INFO ]   distances_file: d.csv
+[INFO ]   exact_distances_file: ""
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: ""
+[INFO ]   k: 4
+[INFO ]   neighbors_file: n.csv
+[INFO ]   num_projections: 2
+[INFO ]   num_tables: 10
+[INFO ]   output_model_file: ""
+[INFO ]   query_file: queries.csv
+[INFO ]   reference_file: refs.csv
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   drusilla_select_construct: 0.000645s
+[INFO ]   drusilla_select_search: 0.000551s
+[INFO ]   loading_data: 0.008518s
+[INFO ]   saving_data: 0.003734s
+[INFO ]   total_time: 0.014019s
+ at endcode
+
+ at subsection cli_ex3_akfntut Using QDAFN instead of DrusillaSelect
+
+The algorithm to be used for approximate furthest neighbor search can be
+specified with the \c --algorithm (\c -a) option to the \c mlpack_approx_kfn
+program.  Below, we use the QDAFN algorithm instead of the default.  We leave
+the \c -p and \c -t options at their defaults---even though QDAFN often requires
+more tables and points to get the same quality of results.
+
+ at code
+$ mlpack_approx_kfn -q queries.csv -r refs.csv -v -k 3 -n n.csv -d d.csv -a qdafn
+[INFO ] Loading 'refs.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Building QDAFN model...
+[INFO ] Model built.
+[INFO ] Loading 'queries.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Searching for 3 furthest neighbors with QDAFN...
+[INFO ] Search complete.
+[INFO ] Saving CSV data to 'n.csv'.
+[INFO ] Saving CSV data to 'd.csv'.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: qdafn
+[INFO ]   calculate_error: false
+[INFO ]   distances_file: d.csv
+[INFO ]   exact_distances_file: ""
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: ""
+[INFO ]   k: 3
+[INFO ]   neighbors_file: n.csv
+[INFO ]   num_projections: 5
+[INFO ]   num_tables: 5
+[INFO ]   output_model_file: ""
+[INFO ]   query_file: queries.csv
+[INFO ]   reference_file: refs.csv
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   loading_data: 0.008380s
+[INFO ]   qdafn_construct: 0.003399s
+[INFO ]   qdafn_search: 0.000886s
+[INFO ]   saving_data: 0.002253s
+[INFO ]   total_time: 0.015465s
+ at endcode
+
+ at subsection cli_ex4_akfntut Printing results quality with exact distances
+
+The \c mlpack_approx_kfn program can calculate the quality of the results if the
+\c --calculate_error (\c -e) flag is specified.  Below we use the program with
+its default parameters and calculate the error, which is displayed in the
+output.  The error is only calculated for the furthest neighbor, not all k;
+therefore, in this example we have set \c -k to \c 1.
+
+ at code
+$ mlpack_approx_kfn -q queries.csv -r refs.csv -v -k 1 -e -q -n n.csv
+[INFO ] Loading 'refs.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Building DrusillaSelect model...
+[INFO ] Model built.
+[INFO ] Loading 'queries.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Searching for 1 furthest neighbors with DrusillaSelect...
+[INFO ] Search complete.
+[INFO ] Calculating exact distances...
+[INFO ] 28891 node combinations were scored.
+[INFO ] 37735 base cases were calculated.
+[INFO ] Calculation complete.
+[INFO ] Average error: 1.08417.
+[INFO ] Maximum error: 1.28712.
+[INFO ] Minimum error: 1.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: ds
+[INFO ]   calculate_error: true
+[INFO ]   distances_file: ""
+[INFO ]   exact_distances_file: ""
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: ""
+[INFO ]   k: 3
+[INFO ]   neighbors_file: ""
+[INFO ]   num_projections: 5
+[INFO ]   num_tables: 5
+[INFO ]   output_model_file: ""
+[INFO ]   query_file: queries.csv
+[INFO ]   reference_file: refs.csv
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   computing_neighbors: 0.001476s
+[INFO ]   drusilla_select_construct: 0.000309s
+[INFO ]   drusilla_select_search: 0.000495s
+[INFO ]   loading_data: 0.008462s
+[INFO ]   total_time: 0.011670s
+[INFO ]   tree_building: 0.000202s
+ at endcode
+
+Note that the output includes three lines indicating the error:
+
+ at code
+[INFO ] Average error: 1.08417.
+[INFO ] Maximum error: 1.28712.
+[INFO ] Minimum error: 1.
+ at endcode
+
+In this case, a minimum error of 1 indicates an exact result, and over the
+entire query set the algorithm has returned a furthest neighbor candidate with
+maximum error 1.28712.
+
+ at subsection cli_ex5_akfntut Using cached exact distances for quality results
+
+However, for large datasets, calculating the error may take a long time, because
+the exact furthest neighbors must be calculated.  Therefore, if the exact
+furthest neighbor distances are already known, they may be passed in with the
+\c --exact_distances_file (\c -x) option in order to avoid the calculation.  In
+the example below, we assume \c exact.csv contains the exact furthest neighbor
+distances.  We run the \c qdafn algorithm in this example.
+
+Note that the \c -e option must be specified for the \c -x option have any
+effect.
+
+ at code
+$ mlpack_approx_kfn -q queries.csv -r refs.csv -k 1 -e -x exact.csv -n n.csv -v -a qdafn
+[INFO ] Loading 'refs.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Building QDAFN model...
+[INFO ] Model built.
+[INFO ] Loading 'queries.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Searching for 1 furthest neighbors with QDAFN...
+[INFO ] Search complete.
+[INFO ] Loading 'exact.csv' as raw ASCII formatted data.  Size is 1 x 1000.
+[INFO ] Average error: 1.06914.
+[INFO ] Maximum error: 1.67407.
+[INFO ] Minimum error: 1.
+[INFO ] Saving CSV data to 'n.csv'.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: qdafn
+[INFO ]   calculate_error: true
+[INFO ]   distances_file: ""
+[INFO ]   exact_distances_file: exact.csv
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: ""
+[INFO ]   k: 1
+[INFO ]   neighbors_file: n.csv
+[INFO ]   num_projections: 5
+[INFO ]   num_tables: 5
+[INFO ]   output_model_file: ""
+[INFO ]   query_file: queries.csv
+[INFO ]   reference_file: refs.csv
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   loading_data: 0.010348s
+[INFO ]   qdafn_construct: 0.000318s
+[INFO ]   qdafn_search: 0.000793s
+[INFO ]   saving_data: 0.000259s
+[INFO ]   total_time: 0.012254s
+ at endcode
+
+ at subsection cli_ex6_akfntut Using tree-based approximation with mlpack_kfn
+
+The \c mlpack_kfn algorithm allows specifying a desired approximation level with
+the \c --epsilon (\c -e) option.  The parameter must be greater than or equal
+to 0 and less than 1.  A setting of 0 indicates exact search.
+
+The example below runs dual-tree furthest neighbor search (the default
+algorithm) with the approximation parameter set to 0.5.
+
+ at code
+$ mlpack_kfn -q queries.csv -r refs.csv -v -k 3 -e 0.5 -n n.csv -d d.csv
+[INFO ] Loading 'refs.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Loaded reference data from 'refs.csv' (3x1000).
+[INFO ] Building reference tree...
+[INFO ] Tree built.
+[INFO ] Loading 'queries.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Loaded query data from 'queries.csv' (3x1000).
+[INFO ] Searching for 3 neighbors with dual-tree kd-tree search...
+[INFO ] 1611 node combinations were scored.
+[INFO ] 13938 base cases were calculated.
+[INFO ] 1611 node combinations were scored.
+[INFO ] 13938 base cases were calculated.
+[INFO ] Search complete.
+[INFO ] Saving CSV data to 'n.csv'.
+[INFO ] Saving CSV data to 'd.csv'.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: dual_tree
+[INFO ]   distances_file: d.csv
+[INFO ]   epsilon: 0.5
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: ""
+[INFO ]   k: 3
+[INFO ]   leaf_size: 20
+[INFO ]   naive: false
+[INFO ]   neighbors_file: n.csv
+[INFO ]   output_model_file: ""
+[INFO ]   percentage: 1
+[INFO ]   query_file: queries.csv
+[INFO ]   random_basis: false
+[INFO ]   reference_file: refs.csv
+[INFO ]   seed: 0
+[INFO ]   single_mode: false
+[INFO ]   tree_type: kd
+[INFO ]   true_distances_file: ""
+[INFO ]   true_neighbors_file: ""
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   computing_neighbors: 0.000442s
+[INFO ]   loading_data: 0.008060s
+[INFO ]   saving_data: 0.002850s
+[INFO ]   total_time: 0.012667s
+[INFO ]   tree_building: 0.000251s
+ at endcode
+
+Note that the format of the output files \c d.csv and \c n.csv are the same as
+for \c mlpack_approx_kfn.
+
+ at subsection cli_ex7_akfntut Different algorithms with 'mlpack_kfn'
+
+The \c mlpack_kfn program offers a large number of different algorithms that can
+be used.  The \c --algorithm (\c -a) may be used to specify three main different
+algorithm types: \c naive (brute-force search), \c single_tree (single-tree
+search), \c dual_tree (dual-tree search, the default), and \c greedy
+("defeatist" greedy search, which goes to one leaf node of the tree then
+terminates).  The example below uses single-tree search to find approximate
+neighbors with epsilon set to 0.1.
+
+ at code
+mlpack_kfn -q queries.csv -r refs.csv -v -k 3 -e 0.1 -n n.csv -d d.csv -a single_tree
+[INFO ] Loading 'refs.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Loaded reference data from 'refs.csv' (3x1000).
+[INFO ] Building reference tree...
+[INFO ] Tree built.
+[INFO ] Loading 'queries.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Loaded query data from 'queries.csv' (3x1000).
+[INFO ] Searching for 3 neighbors with single-tree kd-tree search...
+[INFO ] 13240 node combinations were scored.
+[INFO ] 15924 base cases were calculated.
+[INFO ] Search complete.
+[INFO ] Saving CSV data to 'n.csv'.
+[INFO ] Saving CSV data to 'd.csv'.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: single_tree
+[INFO ]   distances_file: d.csv
+[INFO ]   epsilon: 0.1
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: ""
+[INFO ]   k: 3
+[INFO ]   leaf_size: 20
+[INFO ]   naive: false
+[INFO ]   neighbors_file: n.csv
+[INFO ]   output_model_file: ""
+[INFO ]   percentage: 1
+[INFO ]   query_file: queries.csv
+[INFO ]   random_basis: false
+[INFO ]   reference_file: refs.csv
+[INFO ]   seed: 0
+[INFO ]   single_mode: false
+[INFO ]   tree_type: kd
+[INFO ]   true_distances_file: ""
+[INFO ]   true_neighbors_file: ""
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   computing_neighbors: 0.000850s
+[INFO ]   loading_data: 0.007858s
+[INFO ]   saving_data: 0.003445s
+[INFO ]   total_time: 0.013084s
+[INFO ]   tree_building: 0.000250s
+ at endcode
+
+ at subsection cli_ex8_akfntut Saving a model for later use
+
+The \c mlpack_approx_kfn and \c mlpack_kfn programs both allow models to be
+saved and loaded for future use.  The \c --output_model_file (\c -M) option
+allows specifying where to save a model, and the \c --input_model_file (\c -m)
+option allows a model to be loaded instead of trained.  So, if you specify
+\c --input_model_file then you do not need to specify \c --reference_file
+(\c -r), \c --num_projections (\c -p), or \c --num_tables (\c -t).
+
+The example below saves a model with 10 projections and 5 tables.  Note that
+neither \c --query_file (\c -q) nor \c -k are specified; this run only builds
+the model and saves it to \c model.bin.
+
+ at code
+$ mlpack_approx_kfn -r refs.csv -t 5 -p 10 -v -M model.bin
+[INFO ] Loading 'refs.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Building DrusillaSelect model...
+[INFO ] Model built.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: ds
+[INFO ]   calculate_error: false
+[INFO ]   distances_file: ""
+[INFO ]   exact_distances_file: ""
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: ""
+[INFO ]   k: 0
+[INFO ]   neighbors_file: ""
+[INFO ]   num_projections: 10
+[INFO ]   num_tables: 5
+[INFO ]   output_model_file: model.bin
+[INFO ]   query_file: ""
+[INFO ]   reference_file: refs.csv
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   drusilla_select_construct: 0.000321s
+[INFO ]   loading_data: 0.004700s
+[INFO ]   total_time: 0.007320s
+ at endcode
+
+Now, with the model saved, we can run approximate furthest neighbor search on a
+query set using the saved model:
+
+ at code
+$ mlpack_approx_kfn -m model.bin -q queries.csv -k 3 -d d.csv -n n.csv -v
+[INFO ] Loading 'queries.csv' as CSV data.  Size is 3 x 1000.
+[INFO ] Searching for 3 furthest neighbors with DrusillaSelect...
+[INFO ] Search complete.
+[INFO ] Saving CSV data to 'n.csv'.
+[INFO ] Saving CSV data to 'd.csv'.
+[INFO ] 
+[INFO ] Execution parameters:
+[INFO ]   algorithm: ds
+[INFO ]   calculate_error: false
+[INFO ]   distances_file: d.csv
+[INFO ]   exact_distances_file: ""
+[INFO ]   help: false
+[INFO ]   info: ""
+[INFO ]   input_model_file: model.bin
+[INFO ]   k: 3
+[INFO ]   neighbors_file: n.csv
+[INFO ]   num_projections: 5
+[INFO ]   num_tables: 5
+[INFO ]   output_model_file: ""
+[INFO ]   query_file: queries.csv
+[INFO ]   reference_file: ""
+[INFO ]   verbose: true
+[INFO ]   version: false
+[INFO ] 
+[INFO ] Program timers:
+[INFO ]   drusilla_select_search: 0.000878s
+[INFO ]   loading_data: 0.004599s
+[INFO ]   saving_data: 0.003006s
+[INFO ]   total_time: 0.009234s
+ at endcode
+
+These options work in the same way for both the \c mlpack_approx_kfn and
+\c mlpack_kfn programs.
+
+ at subsection cli_final_akfntut Final command-line program notes
+
+Both the \c mlpack_kfn and \c mlpack_approx_kfn programs contain numerous
+options not fully documented in these short examples.  You can run each program
+with the \c --help (\c -h) option for more information.
+
+ at section cpp_ds_akfntut DrusillaSelect C++ class
+
+\b mlpack provides a simple \c DrusillaSelect C++ class that can be used inside
+of C++ programs to perform approximate furthest neighbor search.  The class has
+only one template parameter---\c MatType---which specifies the type of matrix to
+be use.  That means the class can be used with either dense data (of type
+\c arma::mat) or sparse data (of type \c arma::sp_mat).
+
+The following examples show simple usage of this class.
+
+ at subsection cpp_ex1_ds_akfntut Approximate furthest neighbors with defaults
+
+The code below builds a \c DrusillaSelect model with default options on the
+matrix \c dataset, then queries for the approximate furthest neighbor of every
+point in the \c queries matrix.
+
+ at code
+#include <mlpack/methods/approx_kfn/drusilla_select.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+// The query set.
+extern arma::mat queries;
+
+// Construct the model with defaults.
+DrusillaSelect<> ds(dataset);
+
+// Query the model, putting output into the following two matrices.
+arma::mat distances;
+arma::Mat<size_t> neighbors;
+ds.Search(queries, 1, neighbors, distances);
+ at endcode
+
+At the end of this code, both the \c distances and \c neighbors matrices will
+have number of columns equal to the number of columns in the \c queries matrix.
+So, each column of the \c distances and \c neighbors matrices are the distances
+or neighbors of the corresponding column in the \c queries matrix.
+
+ at subsection cpp_ex2_ds_akfntut Custom numbers of tables and projections
+
+The following example constructs a DrusillaSelect model with 10 tables and 5
+projections.  Once that is done it performs the same task as the previous
+example.
+
+ at code
+#include <mlpack/methods/approx_kfn/drusilla_select.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+// The query set.
+extern arma::mat queries;
+
+// Construct the model with custom parameters.
+DrusillaSelect<> ds(dataset, 10, 5);
+
+// Query the model, putting output into the following two matrices.
+arma::mat distances;
+arma::Mat<size_t> neighbors;
+ds.Search(queries, 1, neighbors, distances);
+ at endcode
+
+ at subsection cpp_ex3_ds_akfntut Accessing the candidate set
+
+The \c DrusillaSelect algorithm merely scans the reference set and extracts a
+number of points that will be queried in a brute-force fashion when the
+\c Search() method is called.  We can access this set with the \c CandidateSet()
+method.  The code below prints the fifth point of the candidate set.
+
+ at code
+#include <mlpack/methods/approx_kfn/drusilla_select.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+
+// Construct the model with custom parameters.
+DrusillaSelect<> ds(dataset, 10, 5);
+
+// Print the fifth point of the candidate set.
+std::cout << ds.CandidateSet().col(4).t();
+ at endcode
+
+ at subsection cpp_ex4_ds_akfntut Retraining on a new reference set
+
+It is possible to retrain a \c DrusillaSelect model with new parameters or with
+a new reference set.  This is functionally equivalent to creating a new model.
+The example code below creates a first \c DrusillaSelect model using 3 tables
+and 10 projections, and then retrains this with the same reference set using 10
+tables and 3 projections.
+
+ at code
+#include <mlpack/methods/approx_kfn/drusilla_select.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+
+// Construct the model with initial parameters.
+DrusillaSelect<> ds(dataset, 3, 10);
+
+// Now retrain with different parameters.
+ds.Train(dataset, 10, 3);
+ at endcode
+
+ at subsection cpp_ex5_ds_akfntut Running on sparse data
+
+We can set the template parameter for \c DrusillaSelect to \c arma::sp_mat in
+order to perform furthest neighbor search on sparse data.  This code below
+creates a \c DrusillaSelect model using 4 tables and 6 projections with sparse
+input data, then searches for 3 approximate furthest neighbors.
+
+ at code
+#include <mlpack/methods/approx_kfn/drusilla_select.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::sp_mat dataset;
+// The query dataset.
+extern arma::sp_mat querySet;
+
+// Construct the model on sparse data.
+DrusillaSelect<arma::sp_mat> ds(dataset, 4, 6);
+
+// Search on query data.
+arma::Mat<size_t> neighbors;
+arma::mat distances;
+ds.Search(querySet, 3, neighbors, distances);
+ at endcode
+
+ at section cpp_qdafn_akfntut QDAFN C++ class
+
+\b mlpack also provides a standalone simple \c QDAFN class for furthest neighbor
+search.  The API for this class is virtually identical to the \c DrusillaSelect
+class, and also has one template parameter to specify the type of matrix to be
+used (dense or sparse or other).
+
+The following subsections demonstrate usage of the \c QDAFN class in the same
+way as the previous section's examples for \c DrusillaSelect.
+
+ at subsection cpp_ex1_qdafn_akfntut Approximate furthest neighbors with defaults
+
+The code below builds a \c QDAFN model with default options on the
+matrix \c dataset, then queries for the approximate furthest neighbor of every
+point in the \c queries matrix.
+
+ at code
+#include <mlpack/methods/approx_kfn/qdafn.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+// The query set.
+extern arma::mat queries;
+
+// Construct the model with defaults.
+QDAFN<> qd(dataset);
+
+// Query the model, putting output into the following two matrices.
+arma::mat distances;
+arma::Mat<size_t> neighbors;
+qd.Search(queries, 1, neighbors, distances);
+ at endcode
+
+At the end of this code, both the \c distances and \c neighbors matrices will
+have number of columns equal to the number of columns in the \c queries matrix.
+So, each column of the \c distances and \c neighbors matrices are the distances
+or neighbors of the corresponding column in the \c queries matrix.
+
+ at subsection cpp_ex2_qdafn_akfntut Custom numbers of tables and projections
+
+The following example constructs a QDAFN model with 15 tables and 30
+projections.  Once that is done it performs the same task as the previous
+example.
+
+ at code
+#include <mlpack/methods/approx_kfn/qdafn.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+// The query set.
+extern arma::mat queries;
+
+// Construct the model with custom parameters.
+QDAFN<> qdafn(dataset, 15, 30);
+
+// Query the model, putting output into the following two matrices.
+arma::mat distances;
+arma::Mat<size_t> neighbors;
+qdafn.Search(queries, 1, neighbors, distances);
+ at endcode
+
+ at subsection cpp_ex3_qdafn_akfntut Accessing the candidate set
+
+The \c QDAFN algorithm scans the reference set, extracting points that have been
+projected onto random directions.  Each random direction corresponds to a single
+table.  The \c QDAFN class stores these points as a vector of matrices, which
+can be accessed with the \c CandidateSet() method.  The code below prints the
+fifth point of the candidate set of the third table.
+
+ at code
+#include <mlpack/methods/approx_kfn/qdafn.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+
+// Construct the model with custom parameters.
+QDAFN<> qdafn(dataset, 10, 5);
+
+// Print the fifth point of the candidate set.
+std::cout << ds.CandidateSet(2).col(4).t();
+ at endcode
+
+ at subsection cpp_ex4_qdafn_akfntut Retraining on a new reference set
+
+It is possible to retrain a \c QDAFN model with new parameters or with
+a new reference set.  This is functionally equivalent to creating a new model.
+The example code below creates a first \c QDAFN model using 10 tables
+and 40 projections, and then retrains this with the same reference set using 15
+tables and 25 projections.
+
+ at code
+#include <mlpack/methods/approx_kfn/qdafn.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+
+// Construct the model with initial parameters.
+QDAFN<> qdafn(dataset, 3, 10);
+
+// Now retrain with different parameters.
+qdafn.Train(dataset, 10, 3);
+ at endcode
+
+ at subsection cpp_ex5_qdafn_akfntut Running on sparse data
+
+We can set the template parameter for \c QDAFN to \c arma::sp_mat in
+order to perform furthest neighbor search on sparse data.  This code below
+creates a \c QDAFN model using 20 tables and 60 projections with sparse
+input data, then searches for 3 approximate furthest neighbors.
+
+ at code
+#include <mlpack/methods/approx_kfn/qdafn.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::sp_mat dataset;
+// The query dataset.
+extern arma::sp_mat querySet;
+
+// Construct the model on sparse data.
+QDAFN<arma::sp_mat> qdafn(dataset, 20, 60);
+
+// Search on query data.
+arma::Mat<size_t> neighbors;
+arma::mat distances;
+qdafn.Search(querySet, 3, neighbors, distances);
+ at endcode
+
+ at section cpp_ns_akfntut KFN C++ class
+
+The extensive \c NeighborSearch class also provides a way to search for
+approximate furthest neighbors using a different, tree-based technique.  For
+full documentation on this class, see the
+\ref nstutorial "NeighborSearch tutorial".  The \c KFN class is a convenient
+typedef of the \c NeighborSearch class that can be used to perform the furthest
+neighbors task with kd-trees.
+
+In the following subsections, the \c KFN class is used in short code examples.
+
+ at subsection cpp_ex1_ns_akfntut Simple furthest neighbors example
+
+The \c KFN class has construction semantics similar to \c DrusillaSelect and
+\c QDAFN.  The example below constructs a \c KFN object (which will build the
+tree on the reference set), but note that the third parameter to the constructor
+allows us to specify our desired level of approximation.  In this example we
+choose epsilon = 0.05.  Then, the code searches for 3 approximate furthest
+neighbors.
+
+ at code
+#include <mlpack/methods/neighbor_search/neighbor_search.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference dataset.
+extern arma::mat dataset;
+// The query set.
+extern arma::mat querySet;
+
+// Construct the object, performing the default dual-tree search with
+// approximation level epsilon = 0.05.
+KFN kfn(dataset, KFN::DUAL_TREE_MODE, 0.05);
+
+// Search for approximate furthest neighbors.
+arma::Mat<size_t> neighbors;
+arma::mat distances;
+kfn.Search(querySet, 3, neighbors, distances);
+ at endcode
+
+ at subsection cpp_ex2_ns_akfntut Retraining on a new reference set
+
+Like the \c QDAFN and \c DrusillaSelect classes, the \c KFN class is capable of
+retraining on a new reference set.  The code below demonstrates this.
+
+ at code
+#include <mlpack/methods/neighbor_search/neighbor_search.hpp>
+
+using namespace mlpack::neighbor;
+
+// The original reference set we train on.
+extern arma::mat dataset;
+// The new reference set we retrain on.
+extern arma::mat newDataset;
+
+// Construct the object with approximation level 0.1.
+KFN kfn(dataset, DUAL_TREE_MODE, 0.1);
+
+// Retrain on the new reference set.
+kfn.Train(newDataset);
+ at endcode
+
+ at subsection cpp_ex3_ns_akfntut Searching in single-tree mode
+
+The particular mode to be used in search can be specified in the constructor.
+In this example, we use single-tree search (as opposed to the default of
+dual-tree search).
+
+ at code
+#include <mlpack/methods/neighbor_search/neighbor_search.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference set.
+extern arma::mat dataset;
+// The query set.
+extern arma::mat querySet;
+
+// Construct the object with approximation level 0.25 and in single tree search
+// mode.
+KFN kfn(dataset, SINGLE_TREE_MODE, 0.25);
+
+// Search for 5 approximate furthest neighbors.
+arma::Mat<size_t> neighbors;
+arma::mat distances;
+kfn.Search(querySet, 5, neighbors, distances);
+ at endcode
+
+ at subsection cpp_ex4_ns_akfntut Searching in brute-force mode
+
+If desired, brute-force search ("naive search") can be used to find the furthest
+neighbors; however, the result will not be approximate---it will be exact (since
+every possibility will be considered).  The code below performs exact furthest
+neighbor search by using the \c KFN class in brute-force mode.
+
+ at code
+#include <mlpack/methods/neighbor_search/neighbor_search.hpp>
+
+using namespace mlpack::neighbor;
+
+// The reference set.
+extern arma::mat dataset;
+// The query set.
+extern arma::mat querySet;
+
+// Construct the object in brute-force mode.  We can leave the approximation
+// parameter to its default (0) since brute-force will provide exact results.
+KFN kfn(dataset, NAIVE_MODE);
+
+// Perform the search for 2 furthest neighbors.
+arma::Mat<size_t> neighbors;
+arma::mat distances;
+kfn.Search(querySet, 2, neighbors, distances);
+ at endcode
+
+ at section further_doc_akfntut Further documentation
+
+For further documentation on the approximate furthest neighbor facilities
+offered by \b mlpack, consult the following documentation:
+
+ - \ref nstutorial
+ - \ref mlpack::neighbor::QDAFN "QDAFN class documentation"
+ - \ref mlpack::neighbor::DrusillaSelect "DrusillaSelect class documentation"
+ - \ref mlpack::neighbor::NeighborSearch "NeighborSearch class documentation"
+
+*/

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/mlpack.git



More information about the debian-science-commits mailing list