[SOLVED] Best way to store words count?

Architector_4 · December 30, 2018, 10:31pm

So I’m making a thing that counts how many times each word is present in a given String, but I’m not sure what would be the most efficient way to store the values.

Right now, the first way to store them that came to my mind is to make a new JSONObject words, and then, for every thisWord that I encounter, do words.setInt(thisWord, words.getInt(thisWord)+1 ); to effectively end up with a JSONObject that I can iterate upon to get how many times each word was used.

But, I’m wondering, would a Table with 1 row and column names corresponding to words be faster than a JSONObject?

Should I just make an ArrayList<Word> where Word is a custom class containing the word itself and its count, and then, when I want to add 1 to some word, iterate over all items in that ArrayList until I find that word and add 1 to its count?

Any thoughts?

kll · December 31, 2018, 12:54am

try that example:

IntDict inventory;
String[] lines, words;
String find = "Lorem";
boolean gotit = false;
long startT, stopT, dT;

void setup() {
  size(200, 200);
  inventory = new IntDict();
  lines = loadStrings("http://websitetips.com/articles/copy/lorem/ipsum.txt");
  for ( String l : lines ) {
    words = split(l, ' ');
    for ( String word : words )  inventory.add(word, 1);
  }
  println(inventory);
  println("we have "+inventory.size()+" entries");
  startT = millis();
  inventory.sortKeys();                              // sort alphabetically
  gotit = inventory.hasKey(find);                    // look for search string
  stopT = millis();
  dT = stopT - startT;
  if ( gotit ) {
    println("Yes, we found _" + find + "_: " + inventory.get(find) + " times");
  } else {
    println("Sorry, no _" + find + "_");
  }
  println(" and needed for sort and search " + dT + " msec" );
}

results:

IntDict size=131 { "Lorem": 19, "ipsum": 19, "dolor": 22, "sit": 19, "amet,": 10, "consetetur": 8, "sadipscing": 9, "elitr,": 9, "sed": 19, "diam": 19, "nonumy": 9, "eirmod": 9, "tempor": 10, "invidunt": 9, "ut": 13, "labore": 9, "et": 40, "dolore": 16, "magna": 11, "aliquyam": 9, "erat,": 8, "voluptua.": 8, "At": 9, "vero": 11, "eos": 9, "accusam": 9, "justo": 9, "duo": 9, "dolores": 9, "ea": 11, "rebum.": 9, "Stet": 9, "clita": 9, "kasd": 9, "gubergren,": 9, "no": 9, "sea": 9, "takimata": 9, "sanctus": 9, "est": 9, "amet.": 9, "": 8, "Duis": 3, "autem": 3, "vel": 6, "eum": 3, "iriure": 3, "in": 6, "hendrerit": 3, "vulputate": 3, "velit": 3, "esse": 3, "molestie": 3, "consequat,": 3, "illum": 3, "eu": 3, "feugiat": 3, "nulla": 5, "facilisis": 2, "at": 2, "eros": 2, "accumsan": 2, "iusto": 2, "odio": 2, "dignissim": 2, "qui": 2, "blandit": 2, "praesent": 2, "luptatum": 2, "zzril": 2, "delenit": 2, "augue": 2, "duis": 2, "te": 2, "feugait": 2, "facilisi.": 2, "consectetuer": 2, "adipiscing": 2, "elit,": 2, "nonummy": 2, "nibh": 2, "euismod": 2, "tincidunt": 2, "laoreet": 2, "aliquam": 2, "erat": 2, "volutpat.": 2, "Ut": 2, "wisi": 2, "enim": 2, "ad": 2, "minim": 2, "veniam,": 2, "quis": 2, "nostrud": 2, "exerci": 2, "tation": 2, "ullamcorper": 2, "suscipit": 2, "lobortis": 2, "nisl": 2, "aliquip": 2, "ex": 2, "commodo": 2, "consequat.": 2, "Nam": 1, "liber": 1, "cum": 1, "soluta": 1, "nobis": 1, "eleifend": 1, "option": 1, "congue": 1, "nihil": 1, "imperdiet": 1, "doming": 1, "id": 1, "quod": 1, "mazim": 1, "placerat": 1, "facer": 1, "possim": 1, "assum.": 1, "facilisis.": 1, "erat.": 1, "Consetetur": 1, "--------------------------------------------": 2, "Courtesy": 1, "of": 1, "WebsiteTips.com": 1, "http://websitetips.com/articles/copy/lorem/": 1 }
we have 131 entries
Yes, we found _Lorem_: 19 times
 and needed for sort and search 20 msec

not sure it’s what you need

Architector_4 · December 31, 2018, 1:02am

Oh. Didn’t know that IntDict is a thing. I guess it makes full sense to use that, considering it’s primary intended purpose is exactly what I need.

Thanks!

jeremydouglass · January 15, 2019, 12:13am

In addition to IntDict:

https://processing.org/reference/IntDict.html

you could use a Counter class based on HashMap or TreeMap. Or you could use a Guava and use a Guava MultiSet.

https://google.github.io/guava/releases/16.0/api/docs/com/google/common/collect/Multiset.html

I wrote a CounterMap class to do this for a library in progress. I ended up moving to a TreeMap-based implementation with multiple classes for the library, but here is an early stand-alone sketch I wrote that was based on a HashMap – just a single class. Here is how you would use it for your problem:

  CounterMap things = new CounterMap();
  String woodchuck = "how much wood would a wood chuck chuck if a wood chuck could chuck wood ?";
  println("\"", woodchuck, "\"");
  for (String s : split(woodchuck, ' ')) {
    things.count(s);
  }
  println(things.entrySet());

The main thing that is interesting about this approach is that it is not bound to type – it can count whatever – numbers, strings, characters, Objects.

main tab: CountingMethods

import java.util.Arrays;
import java.util.List;

CounterMap things;

void setup() {
  things = new CounterMap();

  println("\ncount(): Increments by 1 and returns the new value. If a key is not found, it adds it starting at 1.");
  println("count('a') -->", things.count('a'));  // 1
  println("count('a') -->", things.count('a'));  // 2

  println("\ngetCount(): Returns the value without changing it. Returns null if the key is not found.");
  println("getCount('a') -->", things.getCount('a'));  // 2
  println("getCount('z') -->", things.getCount('z'));  // null

  println("\nremove(): Removes a key, returning its value. Returns null if the key is not found.");
  println("remove('a') -->", things.remove('a'));  // 2
  println("remove('a') -->", things.remove('a'));  // null

  println("\nclear: Clears the map.\nclear()");
  things.clear();

  println("\nA CounterMap can generate counts for each unique words in string:");
  String woodchuck = "how much wood would a wood chuck chuck if a wood chuck could chuck wood ?";
  println("\"", woodchuck, "\"");
  for (String s : split(woodchuck, ' ')) {
    things.count(s);
  }
  println("entrySet()  -->", things.entrySet());
  println("getCount(\"wood\")  -->", things.getCount("wood"));  // 4
  println("getCount(\"a\")  -->", things.getCount("a"));  // 2

  println("\nA CounterMap can count almost anything -- characters, floats, integers, Strings, or Objects.");
  things.clear();
  println("count('m') -->", things.count('m'));  // 1
  println("count(1.4) -->", things.count(1.4));  // 1
  println("count(1.4) -->", things.count(1.4));  // 2
  println("count((int)1) -->", things.count((int)1));  // 1
  println("count((int)1) -->", things.count((int)1));  // 2
  println("count((int)1) -->", things.count((int)1));  // 3
  println("count(LEFT) -->", things.count(LEFT));  // 1
  println("count(LEFT) -->", things.count(LEFT));  // 2
  println("count(\"a string\") -->", things.count("a string"));  // 1
  println("entrySet() -->", things.entrySet());
  
  println("\nclear: Clears the map.\nclear()");
  things.clear();

  println("\nHowever, mixing types of data in one map is usually not a good idea. For example, the char 'a' and the String \"a\" are different.");
  things.clear();

  println("count(\"a\")  -->", things.count("a"));
  println("count(\"A\")  -->", things.count("A"));
  println("count(\"a \") -->", things.count("a "));
  println("count('a') -->", things.count('a'));
  println("count('a') -->", things.count('a'));
  println("entrySet() -->", things.entrySet());

  things.clear();

  println("\nsetCount(): Sets a count to a specific value. It returns the previous value, or null if the key did not exist.");
  println("setCount('x', 10) -->", things.setCount('x', 10));
  println("setCount('x', -10) -->", things.setCount('x', -10));
  println("setCount('x', 10) -->", things.setCount('x', 0));
  println("getCount('x')  -->", things.getCount('x'));

  println("\ncount() can also proceed by an arbitrary increment.");
  println("count('x', 5) -->", things.count('x', 5));  // 5
  println("count('x', 5) -->", things.count('x', 5));  // 10
  println("count('x', -1) -->", things.count('x', -1));  // 9
  println("count('x', -1) -->", things.count('x', -1));  // 8
  
  exit();
}

tab: CounterMap

import java.util.HashMap;
import java.util.Map;

class CounterMap<K> extends HashMap<K, Integer> {

  public Integer count (K key) {
    return count(key, 1);
  }

  public Integer count (K key, Integer increment) {
    Integer newValue = getOrDefault(key, 0) + increment;
    put(key, newValue);
    return newValue;
  }

  public Integer getCount (K key) {
    return get(key);
  }

  @Override
    public Integer remove (Object key) {
    return super.remove(key);
  }

  public Integer setCount (K key, Integer value) {
    return put(key, value);
  }
}

Architector_4 · January 15, 2019, 1:55pm

Thanks for the reply!
That class of yours does look like a powerful solution, however IntDict is a class I’d still prefer to use at it fits my need exactly.
In any case, its sortValuesReverse() method is magnitudes faster than what I used before for this purpose - which was my implementation of selection sort on a regular array done with constant juggling of Word objects in that array…

jeremydouglass · January 15, 2019, 4:58pm

Great – glad that IntDict is working for you!

Yes, sorting is exactly the reason I ended up moving towards TreeMap. I probably would have stuck with IntDict myself except I did that work for keystroke toggling and counting, so I wanted to be able to store, count, and sort either chars or a mix of chars and ints.

Topic		Replies	Views
Problem with splitTokens?! Coding Questions	8	382	March 17, 2021
Finding wordcount/frequency with a given text file Coding Questions homework	3	716	December 6, 2019
Int array and Int dict Beginners	6	423	September 19, 2020
Count identical strings in array for each unique entry Coding Questions	6	535	January 9, 2020
Question with duplication words Coding Questions homework	6	344	April 21, 2021

[SOLVED] Best way to store words count?

Related topics