[SOLVED] Best way to store words count?

So I’m making a thing that counts how many times each word is present in a given String, but I’m not sure what would be the most efficient way to store the values.

Right now, the first way to store them that came to my mind is to make a new JSONObject words, and then, for every thisWord that I encounter, do words.setInt(thisWord, words.getInt(thisWord)+1 ); to effectively end up with a JSONObject that I can iterate upon to get how many times each word was used.

But, I’m wondering, would a Table with 1 row and column names corresponding to words be faster than a JSONObject?

Should I just make an ArrayList<Word> where Word is a custom class containing the word itself and its count, and then, when I want to add 1 to some word, iterate over all items in that ArrayList until I find that word and add 1 to its count?

Any thoughts?

1 Like

try that example:

IntDict inventory;
String[] lines, words;
String find = "Lorem";
boolean gotit = false;
long startT, stopT, dT;

void setup() {
  size(200, 200);
  inventory = new IntDict();
  lines = loadStrings("http://websitetips.com/articles/copy/lorem/ipsum.txt");
  for ( String l : lines ) {
    words = split(l, ' ');
    for ( String word : words )  inventory.add(word, 1);
  }
  println(inventory);
  println("we have "+inventory.size()+" entries");
  startT = millis();
  inventory.sortKeys();                              // sort alphabetically
  gotit = inventory.hasKey(find);                    // look for search string
  stopT = millis();
  dT = stopT - startT;
  if ( gotit ) {
    println("Yes, we found _" + find + "_: " + inventory.get(find) + " times");
  } else {
    println("Sorry, no _" + find + "_");
  }
  println(" and needed for sort and search " + dT + " msec" );
}

results:

IntDict size=131 { "Lorem": 19, "ipsum": 19, "dolor": 22, "sit": 19, "amet,": 10, "consetetur": 8, "sadipscing": 9, "elitr,": 9, "sed": 19, "diam": 19, "nonumy": 9, "eirmod": 9, "tempor": 10, "invidunt": 9, "ut": 13, "labore": 9, "et": 40, "dolore": 16, "magna": 11, "aliquyam": 9, "erat,": 8, "voluptua.": 8, "At": 9, "vero": 11, "eos": 9, "accusam": 9, "justo": 9, "duo": 9, "dolores": 9, "ea": 11, "rebum.": 9, "Stet": 9, "clita": 9, "kasd": 9, "gubergren,": 9, "no": 9, "sea": 9, "takimata": 9, "sanctus": 9, "est": 9, "amet.": 9, "": 8, "Duis": 3, "autem": 3, "vel": 6, "eum": 3, "iriure": 3, "in": 6, "hendrerit": 3, "vulputate": 3, "velit": 3, "esse": 3, "molestie": 3, "consequat,": 3, "illum": 3, "eu": 3, "feugiat": 3, "nulla": 5, "facilisis": 2, "at": 2, "eros": 2, "accumsan": 2, "iusto": 2, "odio": 2, "dignissim": 2, "qui": 2, "blandit": 2, "praesent": 2, "luptatum": 2, "zzril": 2, "delenit": 2, "augue": 2, "duis": 2, "te": 2, "feugait": 2, "facilisi.": 2, "consectetuer": 2, "adipiscing": 2, "elit,": 2, "nonummy": 2, "nibh": 2, "euismod": 2, "tincidunt": 2, "laoreet": 2, "aliquam": 2, "erat": 2, "volutpat.": 2, "Ut": 2, "wisi": 2, "enim": 2, "ad": 2, "minim": 2, "veniam,": 2, "quis": 2, "nostrud": 2, "exerci": 2, "tation": 2, "ullamcorper": 2, "suscipit": 2, "lobortis": 2, "nisl": 2, "aliquip": 2, "ex": 2, "commodo": 2, "consequat.": 2, "Nam": 1, "liber": 1, "cum": 1, "soluta": 1, "nobis": 1, "eleifend": 1, "option": 1, "congue": 1, "nihil": 1, "imperdiet": 1, "doming": 1, "id": 1, "quod": 1, "mazim": 1, "placerat": 1, "facer": 1, "possim": 1, "assum.": 1, "facilisis.": 1, "erat.": 1, "Consetetur": 1, "--------------------------------------------": 2, "Courtesy": 1, "of": 1, "WebsiteTips.com": 1, "http://websitetips.com/articles/copy/lorem/": 1 }
we have 131 entries
Yes, we found _Lorem_: 19 times
 and needed for sort and search 20 msec

not sure it’s what you need

2 Likes

Oh. Didn’t know that IntDict is a thing. I guess it makes full sense to use that, considering it’s primary intended purpose is exactly what I need.

Thanks!

In addition to IntDict:

https://processing.org/reference/IntDict.html

you could use a Counter class based on HashMap or TreeMap. Or you could use a Guava and use a Guava MultiSet.

https://google.github.io/guava/releases/16.0/api/docs/com/google/common/collect/Multiset.html

I wrote a CounterMap class to do this for a library in progress. I ended up moving to a TreeMap-based implementation with multiple classes for the library, but here is an early stand-alone sketch I wrote that was based on a HashMap – just a single class. Here is how you would use it for your problem:

  CounterMap things = new CounterMap();
  String woodchuck = "how much wood would a wood chuck chuck if a wood chuck could chuck wood ?";
  println("\"", woodchuck, "\"");
  for (String s : split(woodchuck, ' ')) {
    things.count(s);
  }
  println(things.entrySet());

The main thing that is interesting about this approach is that it is not bound to type – it can count whatever – numbers, strings, characters, Objects.

main tab: CountingMethods

import java.util.Arrays;
import java.util.List;

CounterMap things;

void setup() {
  things = new CounterMap();

  println("\ncount(): Increments by 1 and returns the new value. If a key is not found, it adds it starting at 1.");
  println("count('a') -->", things.count('a'));  // 1
  println("count('a') -->", things.count('a'));  // 2

  println("\ngetCount(): Returns the value without changing it. Returns null if the key is not found.");
  println("getCount('a') -->", things.getCount('a'));  // 2
  println("getCount('z') -->", things.getCount('z'));  // null

  println("\nremove(): Removes a key, returning its value. Returns null if the key is not found.");
  println("remove('a') -->", things.remove('a'));  // 2
  println("remove('a') -->", things.remove('a'));  // null

  println("\nclear: Clears the map.\nclear()");
  things.clear();

  println("\nA CounterMap can generate counts for each unique words in string:");
  String woodchuck = "how much wood would a wood chuck chuck if a wood chuck could chuck wood ?";
  println("\"", woodchuck, "\"");
  for (String s : split(woodchuck, ' ')) {
    things.count(s);
  }
  println("entrySet()  -->", things.entrySet());
  println("getCount(\"wood\")  -->", things.getCount("wood"));  // 4
  println("getCount(\"a\")  -->", things.getCount("a"));  // 2

  println("\nA CounterMap can count almost anything -- characters, floats, integers, Strings, or Objects.");
  things.clear();
  println("count('m') -->", things.count('m'));  // 1
  println("count(1.4) -->", things.count(1.4));  // 1
  println("count(1.4) -->", things.count(1.4));  // 2
  println("count((int)1) -->", things.count((int)1));  // 1
  println("count((int)1) -->", things.count((int)1));  // 2
  println("count((int)1) -->", things.count((int)1));  // 3
  println("count(LEFT) -->", things.count(LEFT));  // 1
  println("count(LEFT) -->", things.count(LEFT));  // 2
  println("count(\"a string\") -->", things.count("a string"));  // 1
  println("entrySet() -->", things.entrySet());
  
  println("\nclear: Clears the map.\nclear()");
  things.clear();

  println("\nHowever, mixing types of data in one map is usually not a good idea. For example, the char 'a' and the String \"a\" are different.");
  things.clear();

  println("count(\"a\")  -->", things.count("a"));
  println("count(\"A\")  -->", things.count("A"));
  println("count(\"a \") -->", things.count("a "));
  println("count('a') -->", things.count('a'));
  println("count('a') -->", things.count('a'));
  println("entrySet() -->", things.entrySet());

  things.clear();

  println("\nsetCount(): Sets a count to a specific value. It returns the previous value, or null if the key did not exist.");
  println("setCount('x', 10) -->", things.setCount('x', 10));
  println("setCount('x', -10) -->", things.setCount('x', -10));
  println("setCount('x', 10) -->", things.setCount('x', 0));
  println("getCount('x')  -->", things.getCount('x'));

  println("\ncount() can also proceed by an arbitrary increment.");
  println("count('x', 5) -->", things.count('x', 5));  // 5
  println("count('x', 5) -->", things.count('x', 5));  // 10
  println("count('x', -1) -->", things.count('x', -1));  // 9
  println("count('x', -1) -->", things.count('x', -1));  // 8
  
  exit();
}

tab: CounterMap

import java.util.HashMap;
import java.util.Map;

class CounterMap<K> extends HashMap<K, Integer> {

  public Integer count (K key) {
    return count(key, 1);
  }

  public Integer count (K key, Integer increment) {
    Integer newValue = getOrDefault(key, 0) + increment;
    put(key, newValue);
    return newValue;
  }

  public Integer getCount (K key) {
    return get(key);
  }

  @Override
    public Integer remove (Object key) {
    return super.remove(key);
  }

  public Integer setCount (K key, Integer value) {
    return put(key, value);
  }
}
1 Like

Thanks for the reply!
That class of yours does look like a powerful solution, however IntDict is a class I’d still prefer to use at it fits my need exactly.
In any case, its sortValuesReverse() method is magnitudes faster than what I used before for this purpose - which was my implementation of selection sort on a regular array done with constant juggling of Word objects in that array… :smiley:

1 Like

Great – glad that IntDict is working for you!

Yes, sorting is exactly the reason I ended up moving towards TreeMap. I probably would have stuck with IntDict myself except I did that work for keystroke toggling and counting, so I wanted to be able to store, count, and sort either chars or a mix of chars and ints.