[DRE-commits] [ruby-classifier] 01/04: Imported Upstream version 1.3.4

Youhei SASAKI uwabami-guest at moszumanska.debian.org
Sun Mar 23 01:00:52 UTC 2014


This is an automated email from the git hooks/post-receive script.

uwabami-guest pushed a commit to branch master
in repository ruby-classifier.

commit 133841871d63ba50dc6db7aade79dfd467a77d67
Author: Youhei SASAKI <uwabami at gfd-dennou.org>
Date:   Wed Feb 12 02:44:46 2014 +0900

    Imported Upstream version 1.3.4
---
 Gemfile                                |   5 ++
 Gemfile.lock                           |  26 +++++++++++
 README => README.markdown              |  65 +++++++++++++++-----------
 Rakefile                               |  14 +-----
 checksums.yaml.gz                      | Bin 0 -> 268 bytes
 lib/classifier/bayes.rb                |   7 +++
 lib/classifier/extensions/vector.rb    |   2 +-
 lib/classifier/extensions/word_hash.rb |  31 ++++++++----
 metadata.yml                           |  83 ++++++++++++++++-----------------
 test/extensions/word_hash_test.rb      |  21 +++++++++
 10 files changed, 159 insertions(+), 95 deletions(-)

diff --git a/Gemfile b/Gemfile
new file mode 100644
index 0000000..05a1d05
--- /dev/null
+++ b/Gemfile
@@ -0,0 +1,5 @@
+source 'https://rubygems.org'
+gem 'rake'
+gem 'rspec', :require => 'spec'
+gem 'rdoc' 
+gem 'fast-stemmer'
diff --git a/Gemfile.lock b/Gemfile.lock
new file mode 100644
index 0000000..810db3d
--- /dev/null
+++ b/Gemfile.lock
@@ -0,0 +1,26 @@
+GEM
+  remote: https://rubygems.org/
+  specs:
+    diff-lcs (1.2.5)
+    fast-stemmer (1.0.2)
+    json (1.8.1)
+    rake (10.1.1)
+    rdoc (4.1.0)
+      json (~> 1.4)
+    rspec (2.14.1)
+      rspec-core (~> 2.14.0)
+      rspec-expectations (~> 2.14.0)
+      rspec-mocks (~> 2.14.0)
+    rspec-core (2.14.7)
+    rspec-expectations (2.14.4)
+      diff-lcs (>= 1.1.3, < 2.0)
+    rspec-mocks (2.14.4)
+
+PLATFORMS
+  ruby
+
+DEPENDENCIES
+  fast-stemmer
+  rake
+  rdoc
+  rspec
diff --git a/README b/README.markdown
similarity index 69%
rename from README
rename to README.markdown
index fbf7b9c..6304bb0 100644
--- a/README
+++ b/README.markdown
@@ -1,16 +1,18 @@
-== Welcome to Classifier
+## Welcome to Classifier
 
 Classifier is a general module to allow Bayesian and other types of classifications.
 
-== Download
+## Download
 
-* http://rubyforge.org/projects/classifier
+* https://github.com/cardmagic/classifier
 * gem install classifier
-* svn co http://rufy.com/svn/classifier/trunk
+* git clone https://github.com/cardmagic/classifier.git
 
-== Dependencies
-If you install Classifier from source, you'll need to install Martin Porter's stemmer algorithm with RubyGems as follows:
-  gem install stemmer
+## Dependencies
+
+If you install Classifier from source, you'll need to install Roman Shterenzon's fast-stemmer gem with RubyGems as follows:
+
+    gem install fast-stemmer
 
 If you would like to speed up LSI classification by at least 10x, please install the following libraries:
 GNU GSL:: http://www.gnu.org/software/gsl
@@ -18,10 +20,12 @@ rb-gsl:: http://rb-gsl.rubyforge.org
 
 Notice that LSI will work without these libraries, but as soon as they are installed, Classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.
 
-== Bayes
+## Bayes
+
 A Bayesian classifier by Lucas Carlson. Bayesian Classifiers are accurate, fast, and have modest memory requirements.
 
-=== Usage
+### Usage
+
     require 'classifier'
     b = Classifier::Bayes.new 'Interesting', 'Uninteresting'
     b.train_interesting "here are some good words. I hope you love them"
@@ -39,50 +43,55 @@ A Bayesian classifier by Lucas Carlson. Bayesian Classifiers are accurate, fast,
 
 Using Madeleine, your application can persist the learned data over time.
 
-=== Bayesian Classification
+### Bayesian Classification
 
 * http://www.process.com/precisemail/bayesian_filtering.htm
 * http://en.wikipedia.org/wiki/Bayesian_filtering
 * http://www.paulgraham.com/spam.html
 
-== LSI
+## LSI
+
 A Latent Semantic Indexer by David Fayram. Latent Semantic Indexing engines
 are not as fast or as small as Bayesian classifiers, but are more flexible, providing 
 fast search and clustering detection as well as semantic analysis of the text that 
 theoretically simulates human learning.
 
-=== Usage
-  require 'classifier'
-  lsi = Classifier::LSI.new
-  strings = [ ["This text deals with dogs. Dogs.", :dog],
+### Usage
+
+    require 'classifier'
+    lsi = Classifier::LSI.new
+    strings = [ ["This text deals with dogs. Dogs.", :dog],
               ["This text involves dogs too. Dogs! ", :dog],
               ["This text revolves around cats. Cats.", :cat],
               ["This text also involves cats. Cats!", :cat],
               ["This text involves birds. Birds.",:bird ]]
-  strings.each {|x| lsi.add_item x.first, x.last}
+    strings.each {|x| lsi.add_item x.first, x.last}
   
-  lsi.search("dog", 3)
-  # returns => ["This text deals with dogs. Dogs.", "This text involves dogs too. Dogs! ", 
-  #             "This text also involves cats. Cats!"]
+    lsi.search("dog", 3)
+    # returns => ["This text deals with dogs. Dogs.", "This text involves dogs too. Dogs! ", 
+    #             "This text also involves cats. Cats!"]
 
-  lsi.find_related(strings[2], 2)
-  # returns => ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"]
+    lsi.find_related(strings[2], 2)
+    # returns => ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"]
   
-  lsi.classify "This text is also about dogs!"
-  # returns => :dog
+    lsi.classify "This text is also about dogs!"
+    # returns => :dog
   
 Please see the Classifier::LSI documentation for more information. It is possible to index, search and classify
 with more than just simple strings. 
 
-=== Latent Semantic Indexing
+### Latent Semantic Indexing
+
 * http://www.c2.com/cgi/wiki?LatentSemanticIndexing
 * http://www.chadfowler.com/index.cgi/Computing/LatentSemanticIndexing.rdoc
 * http://en.wikipedia.org/wiki/Latent_semantic_analysis
 
-== Authors    
-* Lucas Carlson  (mailto:lucas at rufy.com)
-* David Fayram II (mailto:dfayram at gmail.com)
-* Cameron McBride (mailto:cameron.mcbride at gmail.com)
+## Authors    
+
+* Lucas Carlson  (lucas at rufy.com)
+* David Fayram II (dfayram at gmail.com)
+* Cameron McBride (cameron.mcbride at gmail.com)
+* Ivan Acosta-Rubio (ivan at softwarecriollo.com)
 
 This library is released under the terms of the GNU LGPL. See LICENSE for more details.
 
diff --git a/Rakefile b/Rakefile
index 65018d7..feaa506 100644
--- a/Rakefile
+++ b/Rakefile
@@ -1,16 +1,9 @@
 require 'rubygems'
 require 'rake'
 require 'rake/testtask'
-require 'rake/rdoctask'
-require 'rake/gempackagetask'
+require 'rdoc/task'
 require 'rake/contrib/rubyforgepublisher'
 
-PKG_VERSION = "1.3.3"
-
-PKG_FILES = FileList[
-    "lib/**/*", "bin/*", "test/**/*", "[A-Z]*", "Rakefile", "html/**/*"
-]
-
 desc "Default Task"
 task :default => [ :test ]
 
@@ -75,11 +68,6 @@ spec = Gem::Specification.new do |s|
   s.homepage = "http://classifier.rufy.com/"
 end
 
-Rake::GemPackageTask.new(spec) do |pkg|
-  pkg.need_zip = true
-  pkg.need_tar = true
-end
-
 desc "Report code statistics (KLOCs, etc) from the application"
 task :stats do
   require 'code_statistics'
diff --git a/checksums.yaml.gz b/checksums.yaml.gz
new file mode 100644
index 0000000..43d6112
Binary files /dev/null and b/checksums.yaml.gz differ
diff --git a/lib/classifier/bayes.rb b/lib/classifier/bayes.rb
index 26191e2..39a25b2 100644
--- a/lib/classifier/bayes.rb
+++ b/lib/classifier/bayes.rb
@@ -12,6 +12,7 @@ class Bayes
 		@categories = Hash.new
 		categories.each { |category| @categories[category.prepare_category_name] = Hash.new }
 		@total_words = 0
+                @category_counts = Hash.new(0)
 	end
 
 	#
@@ -23,6 +24,7 @@ class Bayes
 	#     b.train "The other", "The other text"
 	def train(category, text)
 		category = category.prepare_category_name
+                @category_counts[category] += 1
 		text.word_hash.each do |word, count|
 			@categories[category][word]     ||=     0
 			@categories[category][word]      +=     count
@@ -40,6 +42,7 @@ class Bayes
 	#     b.untrain :this, "This text"
 	def untrain(category, text)
 		category = category.prepare_category_name
+                @category_counts[category] -= 1
 		text.word_hash.each do |word, count|
 			if @total_words >= 0
 				orig = @categories[category][word]
@@ -61,6 +64,7 @@ class Bayes
 	# The largest of these scores (the one closest to 0) is the one picked out by #classify
 	def classifications(text)
 		score = Hash.new
+                training_count = @category_counts.values.inject { |x,y| x+y }.to_f
 		@categories.each do |category, category_words|
 			score[category.to_s] = 0
 			total = category_words.values.inject(0) {|sum, element| sum+element}
@@ -68,6 +72,9 @@ class Bayes
 				s = category_words.has_key?(word) ? category_words[word] : 0.1
 				score[category.to_s] += Math.log(s/total.to_f)
 			end
+                        # now add prior probability for the category
+                        s = @category_counts.has_key?(category) ? @category_counts[category] : 0.1
+                        score[category.to_s] += Math.log(s / training_count)
 		end
 		return score
 	end
diff --git a/lib/classifier/extensions/vector.rb b/lib/classifier/extensions/vector.rb
index 271366a..7f8a61d 100644
--- a/lib/classifier/extensions/vector.rb
+++ b/lib/classifier/extensions/vector.rb
@@ -13,7 +13,7 @@ class Array
     if block_given?
       map(&block).sum
     else
-      inject { |sum, element| sum + element }.to_f
+      reduce(:+)
     end
   end
 end
diff --git a/lib/classifier/extensions/word_hash.rb b/lib/classifier/extensions/word_hash.rb
index cef4eb6..928387d 100644
--- a/lib/classifier/extensions/word_hash.rb
+++ b/lib/classifier/extensions/word_hash.rb
@@ -2,6 +2,8 @@
 # Copyright:: Copyright (c) 2005 Lucas Carlson
 # License::   LGPL
 
+require "set"
+
 # These are extensions to the String class to provide convenience 
 # methods for the Classifier package.
 class String
@@ -17,7 +19,9 @@ class String
   # Return a Hash of strings => ints. Each word in the string is stemmed,
   # interned, and indexes to its frequency in the document.  
 	def word_hash
-		word_hash_for_words(gsub(/[^\w\s]/,"").split + gsub(/[\w]/," ").split)
+		word_hash = clean_word_hash()
+		symbol_hash = word_hash_for_symbols(gsub(/[\w]/," ").split)
+		return word_hash.merge(symbol_hash)
 	end
 
 	# Return a word hash without extra punctuation or short symbols, just stemmed words
@@ -28,19 +32,26 @@ class String
 	private
 	
 	def word_hash_for_words(words)
-		d = Hash.new
+		d = Hash.new(0)
 		words.each do |word|
-			word.downcase! if word =~ /[\w]+/
-			key = word.stem.intern
-			if word =~ /[^\w]/ || ! CORPUS_SKIP_WORDS.include?(word) && word.length > 2
-				d[key] ||= 0
-				d[key] += 1
+			word.downcase!
+			if ! CORPUS_SKIP_WORDS.include?(word) && word.length > 2
+				d[word.stem.intern] += 1
 			end
 		end
 		return d
 	end
+
+
+	def word_hash_for_symbols(words)
+		d = Hash.new(0)
+		words.each do |word|
+			d[word.intern] += 1
+		end
+		return d
+	end
 	
-	CORPUS_SKIP_WORDS = [
+	CORPUS_SKIP_WORDS = Set.new([
       "a",
       "again",
       "all",
@@ -121,5 +132,5 @@ class String
       "yes",
       "you",
       "youll",
-      ]
-end
\ No newline at end of file
+      ])
+end
diff --git a/metadata.yml b/metadata.yml
index d47abf8..aba63e7 100644
--- a/metadata.yml
+++ b/metadata.yml
@@ -1,82 +1,79 @@
---- !ruby/object:Gem::Specification 
+--- !ruby/object:Gem::Specification
 name: classifier
-version: !ruby/object:Gem::Version 
-  version: 1.3.3
+version: !ruby/object:Gem::Version
+  version: 1.3.4
 platform: ruby
-authors: 
+authors:
 - Lucas Carlson
 autorequire: classifier
 bindir: bin
 cert_chain: []
-
-date: 2010-07-06 00:00:00 -07:00
-default_executable: 
-dependencies: 
-- !ruby/object:Gem::Dependency 
+date: 2013-12-31 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
   name: fast-stemmer
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: 1.0.0
   type: :runtime
-  version_requirement: 
-  version_requirements: !ruby/object:Gem::Requirement 
-    requirements: 
-    - - ">="
-      - !ruby/object:Gem::Version 
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
         version: 1.0.0
-    version: 
-description: "   A general classifier module to allow Bayesian and other types of classifications.\n"
+description: |2
+     A general classifier module to allow Bayesian and other types of classifications.
 email: lucas at rufy.com
 executables: []
-
 extensions: []
-
 extra_rdoc_files: []
-
-files: 
+files:
+- lib/classifier.rb
 - lib/classifier/bayes.rb
 - lib/classifier/extensions/string.rb
 - lib/classifier/extensions/vector.rb
 - lib/classifier/extensions/vector_serialize.rb
 - lib/classifier/extensions/word_hash.rb
+- lib/classifier/lsi.rb
 - lib/classifier/lsi/content_node.rb
 - lib/classifier/lsi/summary.rb
 - lib/classifier/lsi/word_list.rb
-- lib/classifier/lsi.rb
-- lib/classifier.rb
 - bin/bayes.rb
 - bin/summarize.rb
 - test/bayes/bayesian_test.rb
 - test/extensions/word_hash_test.rb
 - test/lsi/lsi_test.rb
 - test/test_helper.rb
+- Gemfile
+- Gemfile.lock
 - LICENSE
+- README.markdown
 - Rakefile
-- README
-has_rdoc: true
 homepage: http://classifier.rufy.com/
 licenses: []
-
+metadata: {}
 post_install_message: 
 rdoc_options: []
-
-require_paths: 
+require_paths:
 - lib
-required_ruby_version: !ruby/object:Gem::Requirement 
-  requirements: 
-  - - ">="
-    - !ruby/object:Gem::Version 
-      version: "0"
-  version: 
-required_rubygems_version: !ruby/object:Gem::Requirement 
-  requirements: 
-  - - ">="
-    - !ruby/object:Gem::Version 
-      version: "0"
-  version: 
-requirements: 
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements:
 - A porter-stemmer module to split word stems.
 rubyforge_project: 
-rubygems_version: 1.3.5
+rubygems_version: 2.0.3
 signing_key: 
-specification_version: 3
+specification_version: 4
 summary: A general classifier module to allow Bayesian and other types of classifications.
 test_files: []
-
diff --git a/test/extensions/word_hash_test.rb b/test/extensions/word_hash_test.rb
index a3bcf59..6d8feed 100644
--- a/test/extensions/word_hash_test.rb
+++ b/test/extensions/word_hash_test.rb
@@ -12,3 +12,24 @@ class StringExtensionsTest < Test::Unit::TestCase
 	end
 
 end
+
+
+class ArrayExtensionsTest < Test::Unit::TestCase
+
+  def test_plays_nicely_with_any_array
+    assert_equal [Array].sum, Array
+  end
+
+  def test_monkey_path_array_sum
+    assert_equal [1,2,3].sum, 6
+  end
+
+  def test_summing_an_empty_array
+    assert_equal [nil].sum, 0
+  end
+
+  def test_summing_an_empty_array
+    assert_equal Array[].sum, 0
+  end
+
+end

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/pkg-ruby-extras/ruby-classifier.git



More information about the Pkg-ruby-extras-commits mailing list