Sunday, October 08, 2006

Share precious heap memory across multiple JVM's

I've written some puzzle solvers for fun during college. Among them was for Boggle board. One annoying thing about it is the startup time. It took a while to load up a 4 MB English dictionary into memory. Some of my other puzzle solvers also use the same dictionary. Wouldn't it be cool if you only need to load this dictionary once and all of the apps could use it? It would stay resident in the heap and ready to be used in different applications and between runs.

One other solution is to use a database but this is not always preferable, mostly due to speed, and of course it is not as elegant. I wanted to use the dictionary object as a native Java object. The idea was shelved for awhile until I came to work with Terracotta, whose distributed shared objects (DSO) might just pull off the resident-heap-memory idea nicely.

First, let's take look at the puzzle solver. Given a 4 by 4 board, the main objective is to find as many words as possible that can be constructed from the letters of sequentially adjacent cubes. The board I’m solving in the game has the content as follows:
c  a  t  d
l i n e
m a r o
p e t s

Some obvious words are: calm, line, mine, and dent.

The program will use the dictionary to scan for all possibilities. A recursive algorithm works great in solving this puzzle. For every possible path, it will look up the current word in dictionary. So let's see how we get the dictionary loaded into memory.

In the Terracotta world, that would mean loading the data into a root (shared object). That shared object will be transparently clustered and made available to any other JVM in the cluster.

The dictionary is implemented as a Trie for fast retrieving. The dictionary loader will read words from a file, one word per line and insert into the trie’s structure. For efficiency, we'll batch the inserts. Here I choose the batch size to be 50. Adjusting this number will affect the load time.


package dictionary;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.List;
import java.util.Vector;


public class DictionaryLoader {

private Dictionary dictionary = new Dictionary();

public void load(File dictFile) {
int batchSize = 50;
List<String> batchOfWords = new Vector<String>();

BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(dictFile));
String word;
while ((word = reader.readLine()) != null) {
batchOfWords.add(word);

if (batchOfWords.size() == batchSize) {
synchronized (dictionary) {
dictionary.add(batchOfWords);
batchOfWords.clear();
}
}
}

/* add the rest of the words left in the batch */
if (batchOfWords.size() > 0) {
synchronized (dictionary) {
dictionary.add(batchOfWords);
}
}

System.out.println("size = " + dictionary.size());

} catch (Exception e) {
e.printStackTrace();
}

finally {
try {
if (reader != null) reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

public static void main(String[] args) {
DictionaryLoader loader = new DictionaryLoader();
long elapsed = System.currentTimeMillis();
loader.load(new File("dict.txt"));
elapsed = System.currentTimeMillis() - elapsed;
System.out.println("Time took to load the dictionary (ms): "
+ elapsed);
}
}



In the Terracotta configuration file, tc-config.xml, we declare the dictionary field to be a shared "root." The entire graph of objects referenceable by this root automatically becomes clustered by Terracotta. We also give this root field a cluster-wide name ("dictionary") with the root-name tag. This is very important because this is how other applications get access to this same root.


<root>
<field-name>dictionary.DictionaryLoader.dictionary</field-name>
<root-name>dictionary</root-name>
</root>



To test our little dictionary, I wrote the DictionaryLookup app. It also declares a dictionary object, but names it differently.


package dictionary;

public class DictionaryLookup {

private static final Dictionary dict = new Dictionary();


public void checkWord(String word) {
if (dict.isWord(word)) {
System.out.println("'" + word + "' is a word" );
}
else if (dict.isPrefix(word)) {
System.out.println("'" + word + "' is a prefix");
}
else {
System.out.println("'" + word + "' is NOT found");
}
}


public static void main(String[] args) {

DictionaryLookup dl = new DictionaryLookup();
String[] words = {"eat", "my", "shorts", "homer", "crapola", "dict",
"config", "configuration", "sweet", "abracadabra"};

for (int i = 0; i < words.length; i++) {
dl.checkWord(words[i]);
}
}
}


The field "dict" is declared as root in tc-config.xml but that doesn’t mean the server will have 2 roots (namely two clustered versions of dictionary objects). The trick is the root-name tag. It tells the server to treat these two roots as one, interchangeably.

<root>
<field-name>dictionary.DictionaryLoader.dictionary</field-name>
<root-name>dictionary</root-name>
</root>
<root>
<field-name>dictionary.DictionaryLookup.dict</field-name>
<root-name>dictionary</root-name>
</root>


The interesting thing about the DictionaryLookup is how it assumes the dictionary content is already there and available for use. All it does is to declare a Dictionary field and Terracotta automatically assigns the same “dictionary” shared root to that field.

DictionaryLookup has proven that this approach works correctly and as I wanted. I can go ahead and attach the dictionary to our puzzle solver:

<root>
<field-name>dictionary.DictionaryLoader.dictionary</field-name>
<root-name>dictionary</root-name>
</root>
<root>
<field-name>dictionary.DictionaryLookup.dict</field-name>
<root-name>dictionary</root-name>
</root>
<root>
<field-name>games.boggle.Board.theDictionary</field-name>
<root-name>dictionary</root-name>
</root>



If you look at Board.java, you'll see how the game can focus on its main logic without having to worry about loading up the dictionary in memory. This eliminates huge amount of time if you choose to use a really big size dictionary. And not only just one application can use dictionary at a time; multiple applications from multiple networks can take advantage of this network attached memory.

Source code available here.