Tuesday, November 27, 2007

Monkey tales: Continuous Integration at Terracotta

As a platform to cluster and scale Java applications, Terracotta usages are greatly varied and resourceful. Further more, Terracotta supports Linux, Windows, and Solaris (also OSX for dev), compiles and runs with Sun JDKs (1.4, 1.5, 1.6) and IBM. As for application servers, we support Tomcat, JBoss, Jetty, Websphere, Weblogic, etc. The list goes on and on. That puts a lot of burden on testings and build infrastructure and we can't think of a better way than continuous integration.

CruiseControl (CC) comes in handy and works really well for us. It basically checks out new changesets from the repo periodically then compile and run all tests. Our custom built system, tcbuild, is JRuby + Ant combo. It has the ability to run defined group of tests such as system tests, unit tests, container tests, crash tests, etc.

We create a CC project for each group of tests, namely check_unit, check_system, check_container, etc that varies in jdk, appserver, test mode. There will be 57 projects with all the permutations.

And each CC project will do:
- check out latest revisions
- compile
- run tests in that project
- report
....

And we do this on multiple boxes that run Redhat, Suse 10, Windows 2003, Solaris 9, 10. We call these monkey machines.

So what happens when there's a bad build (compiling error), or tests fail? CC will email a specific team depends on which group of tests that failed. There's also a mailing list that will keep track of every failure.

This works great but there's a drawback. Whenever there's a compile error, we got spammed and we got it bad. It will fail on all 57 CC runs on each monkey box.

To solve this problem, we devised a two-tier monkey scheme. We would have a monkey box, called "monkey police" that would check out the latest revision(s) every 5 minute. It will compile and run a group crucial system tests. If the build fails to compile or any of the test fails, that means that build is sorely broken. In this case, the police will email the last person(s) and the engineer mailing list about the error. If everything works out fine, the police will mark the latest revision as good and save it onto a shared file. The monkey troops (other test machines) will read from that shared file and only do "svn update" up to the latest known good revision before they run their tests.

So before, the troops will do "svn update" and it will pull down every changes. With this scheme, the troops will do "svn update -r 2333" where "2333" is a good revision that the monkey police has tested.

And we have one monkey police per branch. So when there's build error or fatal test failure, we'll know about it right away and it won't disturb our test machines in the mean time.

Why bother with a bad checkin? :)

This has worked great for us and I thought I'd share the experience.

Monday, October 15, 2007

Solution to classpath too long (aka input line too long) problem in Windows

If you use Java in Windows, you're bound to run into classpath too long problem when your classpath grows. Windows has a limit (1KB - 2KB) of characters you can have on one single command line. The infamous "Input line is too long" is very annoying. Here's a trick to get around it.

"java.exe" command also scan for classes from the environment variable "CLASSPATH". If you can break your classpath into separate folders and jars, you can concatenate them like this:


setlocal

set CLASSPATH=c:/my/first/jar/somejar.jar;%CLASSPATH%
set CLASSPATH=c:/my/second/jar/someotherjar.jar;%CLASSPATH%
set CLASSPATH=c:/path/to/a/folder;%CLASSPATH%
.......
.......

java com.mycompany.Main

endlocal

By using CLASSPATH env variable, you don't need to pass it in to "java" command. The setlocal/endlocal pair ensures the CLASSPATH is "local" to this process and won't pollute the systemwise value.

--------
Updated:

Travis has pointed out that there is a limit on environment variable in Windows. I found it limited to 8K. So this isn't an absolute solution but it should sustain you for awhile. As suggested by D, you could shorten the path by using virtual drive. Further more, JDK6 has supported wildcard (*) in classpath.

Virtual drive:
Say you have bunch of jars under c:/path/to/lib

subst z: c:/my/path/to/lib
set CLASSPATH=z:/jar1.jar:%CLASSPATH%

Or with JDK6:

set CLASSPATH=c:/path/to/lib/*:%CLASSPATH%

Monday, October 08, 2007

Check for broken links in Ruby, Bash script and Java

I found myself writing 3 different versions of a function to check for broken links in Ruby, Bash and Java. Just want to document here in case anyone interested. Note that these functions use HEAD method (as opposed to GET/POST). It won't download a big file just to see if it's live.

Ruby
require 'net/http'
require 'uri'

def isLive?(url)
uri = URI.parse(url)
response = nil
Net::HTTP.start(uri.host, uri.port) { |http|
response = http.head(uri.path.size > 0 ? uri.path : "/")
}
return response.code == "200"
end

puts isLive?("http://google.com")
puts isLive?("http://asdfasdf.com")

Bash
#!/bin/bash

function isLive {
wget -q --spider $1
}

isLive "http://google.com/somefakelink"

if [ $? -eq 0 ]; then
echo "Good link"
else
echo "Broken link"
fi

Java: Edited (December 18 2008): I've written a better Java version that handles link forwarding and doesn't use 3rd party API here

/* need httpunit-1.6.jar http://httpunit.sourceforge.net */

private boolean isLive(String link) {
try {
WebRequest request = new HeadMethodWebRequest(link);
WebConversation wc = new WebConversation();
WebResponse response = wc.getResource(request);
return response.getResponseCode() == 200
} catch (Exception e)
return false;
}
}

Tuesday, September 18, 2007

Toggle code portion in your blog

I'm just testing out my javascript to toggle code sections in my blog. It's a useful feature to help keep your blog appears short while still provides enough details.

Show code


Hope this works :)

Monday, September 17, 2007

Distributed EhCache as second level cache under Hibernate

EhCache is a one of the great options for Hibernate second level cache. By making it distributed, multiple web applications will be able to share the same cache thus enhance your overall performance and availability. To enable the distributed cache, Terracotta 2.4.3 has a built in support for EhCache 1.3.0 and 1.2.4. I will go through an example of how this be done.

The stack could be visualized like this:

----------- -----------
Tomcat 1 Tomcat 2
----------- -----------
Hibernate Hibernate
------------------------------
EhCache
TERRACOTTA
------------------------------


Terracotta is the driving force though its presence is transparent to your web app thanks to bytecode instrumentation. First, let take a look at enabling EhCache in Hibernate configuration file
[+]hibernate.cfg.xml

It's just the same standard setting that you would use for non-distributed case. You want to turn on query cache mode, point Hibernate to ehcache.xml and finally specify a provider. The ehcache.xml can be as simple as this:
[+]ehcache.xml

One thing I'd like to point out is that Terracotta persists heap memory to disk efficiently (and fault them in as needed) so "overflowToDisk" becomes redundant in StandardQueryCache. As a matter of fact, Terracotta doesn't honor this option. Now that the cache is set up, in our entity mapping files, we need to let Hibernate know which entities we'd like to cache during runtime. In my example, I have an Event table (title, date) and a Person table (firstname, lastname) that have a many-to-many relationship through a join table PERSON_EVENT. With that in mind, let's examine Event.hbm.xml and Person.hbm.xml:
[+]Mapping files

The details might be distracting but if you're familiar with Hibernate, this should be as simple as it gets :)

Now we get past all the settings, the fun stuff begins with our servlets. I've created 2 servlets, one called CreateEvents that will populate data into our table. The other, QueryEvents, will query and display the cache hit statistic.
[+]CreateEvents.java

Some default data (3 persons and 2 events) are created during init() phase. With (2), I added an option to add additional Person to the database so later we can use it to demonstrate cache invalidating. To be able to get statistic of cache hit and miss, (3), a query statistic object is created. It will give us the hit/miss count in (4), for the query "select * from Event".
With QueryEvents.java, we will ask Hibernate for list of persons, and events, which would prove to us whether the cache is used or database is used:
[+]QueryEvents.java

I ran these two servlets in 1 Tomcat to make sure everything working correctly:
By hitting http://localhost:8080/Events/create (mapped to CreateEvents servlet)

Events created: [Event 1: 2007-09-30, Event 2: 2007-12-01]
Event query cache miss: 1
Event query cache hit: 0

Since it's the first time we query for events, that's why we have one cache miss. Hibernate went to db to get the data.
Now we hit http://localhost:8080/Events/query (mapped to QueryEvents servlet), the result is:

Events found: [Event 1: 2007-09-30, Event 2: 2007-12-01]
Event query cache miss: 1
Event query cache hit: 1

People found: [Ichigo Kurosaki, Abarai Renji, Ishida Uryu]
Person query cache miss: 1
Person query cache hit: 0

Participants of Even 1: [Abarai Renji, Ishida Uryu, Ichigo Kurosaki]
Participants of Even 2: [Abarai Renji, Ishida Uryu]

As expected, the Event query is now hit the cache, raising the hit count to 1. And since it's the first time we query for Person, the cache miss is 1. The participants list proves that the event pojos coming from the cache are valid and can be re-associated to this Hibernate session. Now, if we run CreateEvents servlet on Tomcat 1 and QueryEvents on Tomcat 2, the distributed cache should give us the same result.
This is where Terracotta comes in. There is no change in settings needed in any of the Hibernate mappings files, nor ehcache.xml. There is no code change either. What you need to do is to run Tomcat with Terracotta enabled. The process involved setting 3 java system properties to Tomcat jvm

-Xbootclasspath/p:"path/to/Terracotta/bootjar"
-Dtc.install-root=/path/to/Terracotta/install
-Dtc.config=/path/to/tc-config.xml

Detailed instructions can be found here
Luckily, Terracotta has a nice Session configuration tool that will help you set up 2 Tomcats (or Weblogic) cluster. All you need is to import your WAR file. I created a Evetns.war file that contains both of my serlvets and all the needed jars. I need to configure tc-config.xml to let Terracotta knows that I'm using Hibernate, EhCache by adding those modules (1). Also, classes that will be shared need to be instrumented (2). Terracotta also supports sharing of session by declaring your webapp name (3). However, I'm not clustering any session in this example.
[+]tc-config.xml

The session configurator will start up Terracotta server and 2 Tomcats. We can now access CreateEvents servlet on the first Tomcat at port 9081 by hitting http://localhost:9081/Events/create and hit QueryEvents on the second Tomcat at http://localhost:9082/Events/query.

The result I got is:
 Events/create:
Events created: [Event 1: 2007-09-30, Event 2: 2007-12-01]
Event query cache miss: 1
Event query cache hit: 0

Events/query:
Events found: [Event 1: 2007-09-30, Event 2: 2007-12-01]
Event query cache miss: 0
Event query cache hit: 1

People found: [Ichigo Kurosaki, Abarai Renji, Ishida Uryu]
Person query cache miss: 1
Person query cache hit: 0

Participants of Even 1: [Abarai Renji, Ishida Uryu, Ichigo Kurosaki]
Participants of Even 2: [Abarai Renji, Ishida Uryu]

If I reload Events/query, the statistic is as expected:

Event query cache miss: 0
Event query cache hit: 2

Person query cache miss: 1
Person query cache hit: 1


To test that the cache is invalidated, after hitting Events/query, in the cache we now have a list of 3 persons. If we create one new person, by hitting http://localhost:9081/Events/create?fn=John&ln=Smith, what we have in the cache now is stale data. Of course, thanks to Terracotta, the second Tomcat + Hibernate is aware of this situation and stale data will be invalidated. Which leads to a cache miss (instead of a hit) when we reload http://localhost:9082/Events/query

Event query cache miss: 0
Event query cache hit: 3

People found: [Ichigo Kurosaki, Abarai Renji, Ishida Uryu, John Smith]
Person query cache miss: 2
Person query cache hit: 1

As you can see, Event query cache hit continues to rise, when we now have a cache miss in Person query since the cached data is made invalid. Hibernate had to hit the database for new fresh data.

I hope I didn't bore you with too much details but I think it's important to each steps. Terracotta is greatly beneficial if you choose to use EhCache as distributed cache with Hibernate.

You can download the project here and give it a try.

Thursday, September 06, 2007

Simple directory browser for Amazon S3

If you have used Amazon S3, you might be annoyed at the lack of support for directory browsing. If you hit your repo url, all you get is an XML file listing your content. So I wrote this small javascript to read that XML file, then display as a directory tree just like you're browsing an FTP site.

For example, say Terracotta has a S3 repo at http://download.terracotta.org. If you click on that link, you'll get an XML file. Now I want to be able to see what we have under http://download.terracotta.org/maven2, you'll get nothing but an error trying to go there.

My javascript is under the page http://download.terracotta.org/maven2/index.html

Just right click on that page and read the source, the javascript is pretty simple. Could have been made better or fancier but I'm no javascript guru :)

Maybe it could be of use for you.

Note: S3 doesn't serve index.html file automatically (after all, it's not a web server) so you have to hit the index.html explicitly.

Sunday, August 26, 2007

Lightweight Google Geocoder with Java

I often look for tools before deciding to build them myself for obvious reasons: saving time and labor. I was looking for a Java implementation of Google geocoder but there seems to be only this one GeoGoogle. My first impression is that the tool is a little heavy for its job. Further more, it chooses to deal with XML format and all the parsing and schema validation is a bit too much for my taste. To use Google geocode service, all you need is to send an HTTP request along with your address. The response will be in either XML or JSON format. It will include details about the address you sent such as city, state, zipcode, etc, but the most important things are longitude and latitude. So I decided to go with JSON and wrote this tool. Hopefully, someone else might find it useful. Many thanks to the folks that wrote JSON-LIB.


public class GGcoder {
private static final String URL = "http://maps.google.com/maps/geo?output=json";
private static final String DEFAULT_KEY = "YOUR_GOOGLE_API_KEY";

public static GAddress geocode(String address, String key) throws Exception {
URL url = new URL(URL + "&q=" + URLEncoder.encode(address, "UTF-8")
+ "&key=" + key);
URLConnection conn = url.openConnection();
ByteArrayOutputStream output = new ByteArrayOutputStream(1024);
IOUtils.copy(conn.getInputStream(), output);
output.close();

GAddress gaddr = new GAddress();
JSONObject json = JSONObject.fromString(output.toString());
JSONObject placemark = (JSONObject) query(json, "Placemark[0]");

final String commonId = "AddressDetails.Country.AdministrativeArea";

gaddr.setFullAddress(query(placemark, "address").toString());
gaddr.setZipCode(query(placemark,
commonId + ".SubAdministrativeArea.Locality.PostalCode.PostalCodeNumber")
.toString());
gaddr.setAddress(query(placemark,
commonId + ".SubAdministrativeArea.Locality.Thoroughfare.ThoroughfareName")
.toString());
gaddr.setCity(query(placemark,
commonId + ".SubAdministrativeArea.SubAdministrativeAreaName").toString());
gaddr.setState(query(placemark, commonId + ".AdministrativeAreaName").toString());
gaddr.setLat(Double.parseDouble(query(placemark, "Point.coordinates[1]")
.toString()));
gaddr.setLng(Double.parseDouble(query(placemark, "Point.coordinates[0]")
.toString()));
return gaddr;
}

public static GAddress geocode(String address) throws Exception {
return geocode(address, DEFAULT_KEY);
}

/* allow query for json nested objects, ie. Placemark[0].address */
private static Object query(JSONObject jo, String query) {
try {
String[] keys = query.split("\\.");
Object r = queryHelper(jo, keys[0]);
for (int i = 1; i < keys.length; i++) {
r = queryHelper(jo.fromObject(r), keys[i]);
}
return r;
} catch (JSONException e) {
return "";
}
}

/* help in query array objects: Placemark[0] */
private static Object queryHelper(JSONObject jo, String query) {
int openIndex = query.indexOf('[');
int endIndex = query.indexOf(']');
if (openIndex > 0) {
String key = query.substring(0, openIndex);
int index = Integer.parseInt(query.substring(openIndex + 1, endIndex));
return jo.getJSONArray(key).get(index);
}
return jo.get(query);
}

public static void main(String[] args) throws Exception {
System.out.println(GGcoder.geocode("650 Townsend st, San Francsico, CA"));
System.out.println(GGcoder.geocode("94103"));
}
}


/* Class to hold geocode result */
public class GAddress {
public String address;
public String fullAddress;
public String zipCode;
public String city;
public String state;
public double lat;
public double lng;

/* getters and setters */

Thursday, August 16, 2007

Make use of Java dynamic proxy

I didn't know about Java dynamic proxy before but finally got a chance to learn and found a good use for it. The test I was writing had the smell of repeating code:

private void testPutWithoutSynch(params...) throws Exception {
...

map.put("k1", "v1");

assertions...
}

private void testPutWithSynch(params...) throws Exception {
...

synchronized(map) {
map.put("k1", "v1");
}

assertions...
}


The test pattern repeated with other operations that you can do with Map like puttAll, remove, etc... Do_set_up() is actually some pre setup steps that I need to perform according to parameters list. As you can see, I have to repeat the test method twice for each Map operation I want to test, with and without the synchronize block.

So there is this code smell and I didn't know get rid of it. Then Tim Eck, my coworker, showed me how to use Java proxy. The idea is to wrap my map in a proxy, and use reflection to invoke methods of the map. By using a invocation handler, I can have a variable to control whether or not I want to use synchronize for that particular operation.

class Handler implements InvocationHandler {
private final Map map;
private boolean useSynch = false;

public Handler(Map map) {
this.map = map;
}

public void setUseSynch(boolean flag) {
this.useSynch = flag;
}

public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
try {
if (useSynch) {
return invokeWithSynch(method, args);
} else {
return method.invoke(map, args);
}
} catch (InvocationTargetException e) {
throw e.getTargetException();
}
}

private Object invokeWithSynch(Method method, Object[] args) throws Throwable {
synchronized (map) {
return method.invoke(map, args);
}
}
}


Then the proxy can be created like this:

Map map = new HashMap();
Handler handler = new Handler(map);
Map mapProxy = (Map)Proxy.newProxyInstance(getClass().getClassLoader(), new Class[] { Map.class }, handler);


With that my test methods can be reduced to:

private void testPut(params..., boolean useSynch) throws Exception {
do_set_up(params)

handler.setUseSynch(useSynch);
mapProxy.put("k1", "v1");

assertions depends on useSynch
}


Then I just need to call this method twice and accomplish the same thing.

testPut(params..., false);
testPut(params..., true);


Pretty neat huh? Java proxy is nice tool to know.

Thursday, July 26, 2007

Cluster a standalone Spring app to calculate Mandelbrot set

I've heard about Spring for awhile and have made little attempt to check out what it was. Most of the descriptions about Spring gave me the impressions that only big enterprise applications would use Spring. Man I was wrong :) As I dug down into Pro Spring book and read the Spring reference, I was able to convert one of my stand alone apps into Spring and reaped the benefit that Spring has to offer. Not only that, I was able to cluster my bean to share it among JVMs with little effort, thanks to Terracotta.

My application is a simple demo of calculating the Mandelbrot set. It is multithreaded, using a queue task and workers. Here is a rough sketch (I'm UML illiterate):

------------------------------------- -----------------------------
MandelbrotModel: CalculateNode : Runnable
rawData: int[][] model: MandelbrotModel
workLoad: BlockingQueue
------------------------------------ 1 ----->* -----------------------------


Each CaculateNode has a reference to the model, thus the workLoad queue. It will take a task (a Segment) out of workLoad, process it, and put back the findings in "rawData". Spring comes into the picture when it allows me to wire the dependency between nodes and model, like so:

<bean id="model" class="mandelbrot.MandelbrotModel" scope="singleton">
<property name="length">
<value>600</value>
</property>
<property name="numNodes">
<value>2</value>
</property>
<property name="numTasks">
<value>20</value>
</property>
</bean>

<bean id="node" class="mandelbrot.CalculateNode" scope="singleton">
<property name="model" ref="model">
</property>
</bean>

The above snippet describes 2 beans, a "model" and a "node". Properties are 1 on 1 matching of fields in your class. Say the field "private int length" in the model is set to 600 pixel, and so on. Spring "magic" is in the "node" bean, it says, point node's field of "model" to the model bean above, hence the dependency injection.

public class CalculateNode implements Runnable {
private MandelbrotModel model;

public void setModel(MandelbrotModel model) {
this.model = model;
}
}

Notice I have a setModel() method in the CalculateNode but I don't have to call it explicitly in my program. Spring will call it and set it for me. My node can do all it work, pretending that it has the model reference. The main method is really simple.

public static void main(String[] args) {
ApplicationContext ctx = new ClassPathXmlApplicationContext(
"mandelbrot.xml");

CalculateNode node1 = (CalculateNode) ctx.getBean("node");
new Thread(node1).start();

CalculateNode node2 = (CalculateNode) ctx.getBean("node");
new Thread(node2).start();
}

I just need to query out 2 nodes and start them in 2 threads, no need to worry where is my model and how the nodes got a reference to it.

But wait, there's more. Spring also offers to publish application events for you for cheap. Say now that I have data of the Mandelbrot, I can display it on a Swing application. I don't want to wait for all the nodes to finish calculation and display the graphic in the end. I want to paint it in real time, as soon as a node finishes with a segment, I'll paint it. To pull that off, I need to know when a segment is finished, the right source would be to ask the model. It knows that information when a node reports back with data. The "ghetto" way would be implement a listener list, traverse that list and and invoke callback function for each listener. This was the old implementation that I had. It does the job but it aint neat. In real world, you might see how this is not desirable sometimes because it couples your components. Spring allows you to solve this problem nicely: if you want to publish events, use the application context and call publishEvent() with your message! That's all you have to do.

Let examine this function in the MandelbrotModel. It's invoked by a node to report result

public synchronized void addSegmentData(int[][] data, Segment segment) {
for (int row = segment.getStart(); row < segment.getEnd(); row++) {
System.arraycopy(data[row - segment.getStart()], 0, rawData[row], 0,
length);
}

ctx.publishEvent(new SegmentEvent(this, segment));
}

"ctx" is a reference to the context of the application, automatically given to anyone if you implement a Spring interface "ApplicationContextAware".

@Override
public void setApplicationContext(ApplicationContext applicationcontext)
throws BeansException {
ctx = applicationcontext;
}

As for someone who wants to listen to events, implements ApplicationListener interface and catch your events:

@Override
public void onApplicationEvent(ApplicationEvent event) {
if (event instanceof SegmentEvent) {
final SegmentEvent segEvent = (SegmentEvent) event;
processSegment(segEvent.getSegment());
}
}

I have my ViewerFrame (extends from JFrame) to act as listener, declared in my XML bean definition file and also has a reference to the model:

<bean id="frame" class="mandelbrot.ViewerFrame" scope="singleton">
<property name="model" ref="model">
</property>
</bean>

There is no change in the main method. As Spring initializes the "frame" bean, it will automatically show up. This is pretty easy because I made itself aware, knowing that it has been created by implementing InitialzingBean interface

@Override
public void afterPropertiesSet() throws Exception {
int length = model.getLength();
this.image = new BufferedImage(length, length, BufferedImage.TYPE_INT_RGB);

this.setSize(length, length);
SwingUtilities.invokeLater(new Runnable() {
public void run() {
setVisible(true);
}
});
}

The fun doesn't end here though. Now I want to have another JVM to join in the calculation, making it go faster (a little bit of drama here since the Mandelbrot set is pretty fast to calculate). The ideal scenario is the second JVM also has the same "model" in memory then its minions of "nodes" can just pound on the work. This second JVM can be on a totally separate machine even. How the heck anyone is gonna pull this off without changing the code, worrying about networking, sharing heap, etc...? This is where the power of Terracotta comes in. I just mark the "model" bean as shared and the SegmentEvent as distributed. That will enable any JVM in the cluster having reference to this model, and events published from model will be cluster-wise.

Terracotta is fully Spring aware, and the snippet below is how I wired it to my Spring app.

<spring>
<jee-application name="*">
<application-contexts>
<application-context>
<paths>
<path>mandelbrot.xml</path>
</paths>
<beans>
<bean name="model">
</bean>
</beans>
<distributed-events>
<distributed-event>mandelbrot.SegmentEvent</distributed-event>
</distributed-events>
</application-context>
</application-contexts>
</jee-application>
</spring>

When I use Terracotta eclipse plugin, after starting a Terracotta server (used to hold the shared objects), starting two instances of my applications that shared the "model" is as easy as hitting the Run button twice. Now I have 2 JVMs, each has 2 nodes working on the same model. Neato. As my bird would say "oh wow" when something excites him.

So there you have it. I'm a Spring novice now but I'm sure there are many cools things from Spring waiting to be discovered. And with the added boost from Terracotta, the fun is multiplied.




Here is the link to the source code

Wednesday, June 27, 2007

Alternative to properties file: YAML beans

Plain ol properties files are great for setting configurations for your apps. But if your configuration requires structured data and you prefer not use XML, there is a great alternative: YAML format and beans.

Let assume we have a config as below. One client and multiple servers.


client:
log: /tmp/client/log
debug: on

servers:
- host: zeus
port: 9510

- host: apollo
port: 9511


Yaml treats the whole config file as a hash map, keys are "client" and "servers". Subsequently, it treats "client" as a another hash map, and "servers" is an array of maps. Sounds complicated but it's straight forward once you're getting a hang of it.

To parse this config, JYaml makes it easier than fixing cereal:


Map config = (Map) YAML.load(new FileReader("config.yml"));

Map client = (Map) config.get("client");
System.out.println("Client: " + client);

List servers = (ArrayList) config.get("servers");
System.out.println("Servers: " + servers);

And output is:

Client: {debug=true, log=/tmp/client/log}
Servers: [{host=zeus, port=9510}, {host=apollo, port=9511}]


You would access client's and server's properties like this:

client.get("log");
Map firstServer = (Map)servers.get(0);
firstServer.get("host");


If you want the config being mapped to Java beans, JYaml can populate the beans for you. This is my preferred way since it's much nicer.

The beans for your config will be as simple as:

/* Top level bean for the whole config */
public class TcConfig {
private List servers;
private Client client;

public List getServers() {
return servers;
}
public void setServers(List servers) {
this.servers = servers;
}
public Client getClient() {
return client;
}
public void setClient(Client client) {
this.client = client;
}
}

/* Client */
public class Client {
private String log;
private boolean debug;

/* getters and setters */

public String toString() {
return "log=" + log + ", debug=" + debug;
}
}

/* Server */
public class Server {
private String host;
private int port;
private String log;

/* getters and setters */

public String toString() {
return "host=" + host + ", port=" + port + ", log=" + log;
}
}


And to populate your beans, just 1 line of code:

TcConfig config = (TcConfig)Yaml.loadType(new File("config.yml"), TcConfig.class);

System.out.println("Servers : " + config.getServers());
System.out.println("Client : " + config.getClient());


XMLBeans can accomplish the same thing with other features like autogenerated Java beans, enforce schema, etc, but for simple things, I prefer Yaml.

Thursday, June 14, 2007

Share that POJO - Hibernate clustered and empowered

I haven't had much experience with OR mapping in Java apps so I wanted to study Hibernate tutorial and tried to cluster it at the same time. Starting with Terracotta 2.4 (currently at stable0), Hibernate is supported.

My first question to the engineer who worked on the feature is "What is being shared in Hibernate?" It turns out, the plain old Java objects (POJOs) that Hibernate constructs from the database are the ones that we're interested in sharing, in this case, across multiple JVMs. I told myself, that's pretty handy, so we don't have to hit the database again for the same information on another node when those objects have already been loaded and shared thanks to Terracotta DSO server. He told me that's not the only cool feature though. With those shared objects from Hibernate, one JVM can just reassociate them to its Hibernate session and start to access colletions that are mapped as one-to-many, many-to-many, that don't exist yet in memory due to lazy initalization. Wow. (Of course, that Hibernate session will have to hit the database for that new info you request. You don't always get free beer!)

With that peaked interest, I worked on this cool and thorough tutorial from Hibernate.org

The schema for this tutorial as follow:

EVENTS PERSON_EVENT PERSON
---------- ------------- ----------
*EVENT_ID *EVENT_ID *PERSON_ID
EVENT_DATE *PERSON_ID FIRSTNAME
TITLE LASTNAME


PERSON_EMAIL_ADDR
------------------
*PERSON_ID
*EMAIL_ADDR


For each table, there is a Javabean to represents it. Each javabean instance will be mapped to a row in the database. Well, that's Hibernate in a nutshell for me (speaking with my own ignorance and limited experience)

The xml mapping for Events table as follow:

<hibernate-mapping>

<class name="events.Event" table="EVENTS">
<id name="id" column="EVENT_ID">
<generator class="native"/>
</id>
<property name="date" type="timestamp" column="EVENT_DATE"/>
<property name="title"/>

<set name="participants" table="PERSON_EVENT" lazy="true" inverse="true" cascade="lock">
<key column="EVENT_ID"/>
<many-to-many column="PERSON_ID" class="events.Person"/>
</set>
</class>

</hibernate-mapping>


It's a many-to-many relationship between Events and Person. Notice I set the lazy loading to true, it is needed to demonstrate the point I mentioned earlier.

Here is Person's map file.


<class name="events.Person" table="PERSON">
<id name="id" column="PERSON_ID">
<generator class="native"/>
</id>
<property name="age"/>
<property name="firstname"/>
<property name="lastname"/>

<set name="events" table="PERSON_EVENT">
<key column="PERSON_ID"/>
<many-to-many column="EVENT_ID" class="events.Event"/>
</set>

<set name="emailAddresses" table="PERSON_EMAIL_ADDR">
<key column="PERSON_ID"/>
<element type="string" column="EMAIL_ADDR"/>
</set>

</class>


For hibernate.cfg.xml file, I changed the config from the tutorial a little bit to use Derby database. It is run as server mode. Here is the change:


<property name='connection.driver_class'>org.apache.derby.jdbc.ClientDriver</property>
<property name='connection.url'>jdbc:derby://localhost:1527/MyDbTest</property>
<property name='connection.username'>user1</property>
<property name='connection.password'>user1</property>


Derby is shipped with jdk1.6.0_01, under java-home/db/lib. To start it,


% java -jar derbyrun.jar server start

Then I created "MyDbTest" database by running this once:

% java -jar derbyrun.jar dblook -d "jdbc:derby://localhost:1527/MyDbTest;create=true"


Now on to the EventManager.java of the tutorial. I didn't follow it fully. I created 2 events, and 3 persons. First event is "Engineer meeting", second one is a "Document Meeting" contain these participants

// engMeeting = {steve, orion, tim}
// docMeeting = {steve, orion}

Here is a snippet of EventManager.java.


public class EventManager {

// shared object - declared as a root in tc-config.xml
private static final List events = new ArrayList();

public static void main(String[] args) {
EventManager mgr = new EventManager();

// create 3 persons Steve, Orion, Tim
Long steveId = mgr.createAndStorePerson("Steve", "Harris");
mgr.addEmailToPerson(steveId, "steve@terracottatech.com");

...

// create 2 events
Long engMeetingId = mgr.createAndStoreEvent("Eng Meeting", new Date());
mgr.addPersonToEvent(steveId, engMeetingId);
mgr.addPersonToEvent(orionId, engMeetingId);
mgr.addPersonToEvent(timId, engMeetingId);

Long docMeetingId = mgr.createAndStoreEvent("Doc Meeting", new Date());

....

// store our events into a shared object
synchronized (events) {
events.addAll(mgr.listEvents());
}

HibernateUtil.getSessionFactory().close();
}

private List listEvents() {

Session session = HibernateUtil.getSessionFactory().getCurrentSession();
session.beginTransaction();

List result = session.createQuery("from Event").list();

session.getTransaction().commit();
return result;
}


Most of it is straight forward and copied from the Hibernate tutorial. Terracotta comes into the picture with the presence of the shared-object "events". It is declared as a root in Terracotta config and it will hold the list of events that Hibernate loaded from database. Here is how declare it in tc-config.xml:

<root>
<field-name>events.EventManager.events</field-name>
<root-name>tcEvents</root-name>
</root>

The synchronized block of shared object "events" is necessary. It's in a context of multithreads and multi-JVM.

So that's enough for us to start the first JVM. I turned on SQL log from log4j so we can see the SQL being generated by Hibernate. There are a lot of sql statements printed out, both "insert" and "select" as you would expect Hibernate to
generate. After it's finished, our "events" list will contain events object, and since it's set with "lazy=true", their participant set will be empty. We can confirm this by looking at Terracotta Admin console:



Under root, we have 1 root, namely "tcEvents" that has 2 objects. Expanding them, and we'll see the participant set is empty. Hibernate internal set is null.

org.hibernate.collection.PersistentSet.set=null

So that is my first app. Here comes my second app that run in a totally seperate VM. I caled it EventChecker and it will list emails of people in the first meeting.

First, it has to know about the same shared object "tcEvents" that we put onto Terracotta DSO Server. We can do this easily by specify it in tc-config.xml:


<roots>
<root>
<field-name>events.EventManager.events</field-name>
<root-name>tcEvents</root-name>
</root>
<root>
<field-name>events.EventChecker.events</field-name>
<root-name>tcEvents</root-name>
</root>
</roots>


By using the same root-name, it's mapped to the same list of EventManger. Its code is pretty short so I list all of it here:


public class EventChecker {
// shared object
private static final List events = new ArrayList();

public static void main(String[] args) {

synchronized (events) {
// list events before even opening a Hibernate session
System.out.println("** events: " + events);

Session session = HibernateUtil.getSessionFactory().getCurrentSession();
session.beginTransaction();

// reassociate transient pojos to this session
for (Iterator it = events.iterator(); it.hasNext(); ) {
session.lock((Event)it.next(), LockMode.NONE);
}

// list people in first event
Event event = (Event)events.get(0);
Set people = event.getParticipants();
System.out.println("** people: " + people);

// list emails of people from first event
Set emails = new HashSet();
for (Iterator it = people.iterator(); it.hasNext(); ) {
Person person = (Person)it.next();
emails.addAll(person.getEmailAddresses());
}
System.out.println("** emails: " + emails);

session.getTransaction().commit();
HibernateUtil.getSessionFactory().close();
}
}
}


Here it is you see the usage of synchorized block on the shared object. We're accessing it in brand new JVM. Without Terracotta, the events list will be empty. With Terracotta, the events list contains events from EventManager, faulting by DSO server transparently.
I then reassociated the pojos to my newly created Hibernate session. This is always needed when you work with Hibernate. I didn't know about this until I got an exception saying I don't have a session or session might have been closed by Hibernate.

After that, I can list the pariticipants and their emails using Hibernate. This is all Hibernate doing. I love it :)

Here is the output, it will show that Hibernate didn't generate SQL to load in the events since it's already loaded and stored by Terracotta. It only loaded from Person and Person_Email_Addr:

** events: [Eng Meeting: 2007-06-14 11:06:12.0, Doc Meeting: 2007-06-14 11:06:12.0]

Hibernate: select participan0_.EVENT_ID as EVENT1_1_, participan0_.PERSON_ID as PERSON2_1_, person1_.PERSON_ID as PERSON1_2_0_, person1_.age as age2_0_, person1_.firstname as firstname2_0_, person1_.lastname as lastname2_0_ from PERSON_EVENT participan0_ left outer join PERSON person1_ on participan0_.PERSON_ID=person1_.PERSON_ID where participan0_.EVENT_ID=?

** people: [Orion Letizi, Steve Harris, Tim Eck]

Hibernate: select emailaddre0_.PERSON_ID as PERSON1_0_, emailaddre0_.EMAIL_ADDR as EMAIL2_0_ from PERSON_EMAIL_ADDR emailaddre0_ where emailaddre0_.PERSON_ID=?
Hibernate: select emailaddre0_.PERSON_ID as PERSON1_0_, emailaddre0_.EMAIL_ADDR as EMAIL2_0_ from PERSON_EMAIL_ADDR emailaddre0_ where emailaddre0_.PERSON_ID=?
Hibernate: select emailaddre0_.PERSON_ID as PERSON1_0_, emailaddre0_.EMAIL_ADDR as EMAIL2_0_ from PERSON_EMAIL_ADDR emailaddre0_ where emailaddre0_.PERSON_ID=?

** emails: [teck@terracottatech.com, steve@terracottatech.com, orion@terracottatech.com]

Looking at the Admin console again, we see that the participant set is now loaded in by Hibernate and shared by Terracotta.



There is one gotcha that I ran into that I thought I should mention. In hibernate.cfg.xml, there is an option to clean out and create the schema every time Hibernate start.

<!-- Drop and re-create the database schema on startup -->
<property name=\"hbm2ddl.auto\">create</property>


When you run EventManger, you need it to create the schema for you. But when you run EventChecker, you don't want Hibernate to wipe out your database. You should comment out that line before running it.

I had fun working on Hibernate tutorial and playing it Terracotta. I don't have much experience with J2EE and Hibernate so I can't really comment on a practical use case. Maybe you will have better idea how to use it :)

Friday, April 13, 2007

Performance comparision between ConcurrentHashMap and synchronized HashMap in Terracotta

Since the introduction of ConcurrentHashMap in Java 5, it has been the better choice over HashMap in highly threaded applications. To see how much better, Brian Goetz from his book "Java Concurrency in Practice" wrote a test to compare performance between the two maps. The test sceanario is this:

For N threads concurrently execute a loop that chooses a random key and looks up value corresponding to that key. If the value is found, it is removed with a probability of 0.02. If not, it is added to the map with a probability of 0.6.

The result is ConcurrentHashMap performs much better when number of threads increases. Below is the graph taken from the book with the author's permission.
Since Open Terracotta supports ConcurrentHashMap, I wanted so see if the performance advantage is still there (and how much) in distributed environment. My test scenario is modeled after Brian's. However, the number of threads are spread over 4 nodes (Linux RH4). Here is my test code:


package tc.qa;

import java.util.Map;
import java.util.Random;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CyclicBarrier;
import java.util.concurrent.atomic.AtomicInteger;

public class ConcurrentHashMapLoadTest extends Thread {
private static double WRITE_PROBABILITY = 0.6;
private static double REMOVE_PROBABILITY = 0.02;
private static int THREAD_COUNT = 4;
private static int VM_COUNT = 4;
private static int RUNTIME = 5 * 60 * 1000;
private static int KEY_RANGE = 100000;

// roots - in TC world, these are static finals and accessible across JVMs
private Map map = new ConcurrentHashMap();
private AtomicInteger throughput = new AtomicInteger(0);
private CyclicBarrier barrier;

private Random random;
private int op;

public ConcurrentHashMapLoadTest() {
random = new Random();
barrier = new CyclicBarrier(THREAD_COUNT);
}

public void run() {

// ready
if (barrier() == 0) {
System.out.println("Started...");
}

// go
long start = System.currentTimeMillis();
while (System.currentTimeMillis() - start < RUNTIME) {
Integer key = new Integer(random.nextInt(KEY_RANGE));
if (get(map, key) != null) {
if (random.nextDouble() < REMOVE_PROBABILITY) {
remove(map, key);
op++;
}
} else {
if (random.nextDouble() < WRITE_PROBABILITY) {
put(map, key, key);
op++;
}
}
op++;
}

throughput.addAndGet(op);

if (barrier() == 0) {
System.out.println("Map type: "
+ map.getClass().getSimpleName());
System.out.println("Runtime: " + RUNTIME);
System.out.println("Number of threads: " + THREAD_COUNT);
System.out.println("Write probability: " + WRITE_PROBABILITY);
System.out.println("Remove probability: " + REMOVE_PROBABILITY);
System.out.println("Ops per second: "
+ (throughput.intValue() * 1000.0 / RUNTIME));
}
}

private int barrier() {
try {
return barrier.await();
} catch (Exception e) {
e.printStackTrace();
}
return -1;
}

private Object get(Map map, Object key) {
if (map instanceof ConcurrentHashMap) {
return map.get(key);
} else {
synchronized (map) {
return map.get(key);
}
}
}

private void put(Map map, Object key, Object value) {
if (map instanceof ConcurrentHashMap) {
map.put(key, value);
} else {
synchronized (map) {
map.put(key, value);
}
}
}

private void remove(Map map, Object key) {
if (map instanceof ConcurrentHashMap) {
map.remove(key);
} else {
synchronized (map) {
map.remove(key);
}
}
}

public static void getParams() {
WRITE_PROBABILITY = Double.parseDouble(System.getProperty("wp", "0.6"));
REMOVE_PROBABILITY = Double.parseDouble(System.getProperty("rmp", "0.02"));
THREAD_COUNT = Integer.parseInt(System.getProperty("thread", "4"));
VM_COUNT = Integer.parseInt(System.getProperty("vm", "4"));
RUNTIME = Integer.parseInt(System.getProperty("runtime", "300000"));
}

public static void main(String[] args) {
getParams();
int threads_per_vm = THREAD_COUNT / VM_COUNT;
for (int t = 0; t < threads_per_vm; t++) {
ConcurrentHashMapLoadTest thread = new ConcurrentHashMapLoadTest();
thread.start();
}
}

}


I wrapped the put(), get() methods to easily switch the map types. After running this test on 4 nodes with jdk1.6.0_01, I got result as below:

The Y axis is throughput, normalized to throughput of 4 threads in ConcurrentHashMap.
With network overhead and locks contention in Terracotta, ConcurrentHashMap still far outweights HashMap. Notice how performance of HashMap doesn't degrade much when number of threads increases. This all thanks to the way Terracotta replicates only the delta changes in the map to other nodes. No serialization involved.

The gap between ConcurrentHashMap and HashMap in Terracotta isn't big as it was in the case of 1 JVM like Brian's test. I'm not sure I'm comparing oranges with apples here because obviously it's not the same test, just the same idea. We're always striving for better performance overall and each release has proven that.

Check us out and let us know what you think. http://www.terracotta.org

Hung-

Sunday, February 11, 2007

Using POSTGIS to find points of interest within a radius

I've recently looked into using PostGIS, an extension that enables support for spatial objects in PostgresSQL database.

The problem I was trying to solve is, given a point with (longitude, latitude), how can I find all other points within a certain radius. I've solved it before using formula to calculate distances between 2 points but it's pretty slow if your table has a lot of records. PostGIS can help solve this problem faster and more elegant.

First thing first, let's get us some data to play around with. I found this excellent tutorial that teaches you how to get zipcode data from the US Census. I'll skim through it here, but you should definitely check out that article.

Assume you have a zipcode table with these columns:

state zip latitude longitude
-------------------------------------------
CA 92230 33.911404 -116.768347
CA 92234 33.807761 -116.464731
CA 92236 33.679872 -116.176562

Now you need to add a column to this table to hold the geometric data represents by [longitude, latitude] pair.

SELECT AddGeometryColumn( 'public', 'zipcode', 'geom', 32661, 'POINT', 2 );

* add column 'geom' into table 'zipcode' under schema 'public'
* the column holds a point with 2 dimensions and has spherical reference id (srid) 32661

* now we populate the table with data converted from srid 4269 (longitude/latitude measurement) to srid 32661 (WGS 84 system)

UPDATE zipcode
SET geom = transform(PointFromText('POINT(' || longitude || ' ' || latitude || ')',4269),32661) ;

The transformation from one spherical reference to another is needed for calculating the surface distance. WGS84 is measured in meters by the way.

Now your table will look like this

state zip latitude longitude geom
-----------------------------------------------
CA 92230 33.911404 -116.768347 0101000020957F0000B7FB4ED5F2C44EC166AFB20D1A3D5341
CA 92234 33.807761 -116.464731 0101000020957F0000AA8102ECECFD4EC18284302439245341
CA 92236 33.679872 -116.176562 0101000020957F000067A4B44D2D3B4FC1596E974B370E5341

Now we only need to write a query, say, find me all the zipcode within 10 mile radius of "92230". (or 16093 meters)


SELECT state, zip
FROM zipcode
WHERE
distance(
   transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661),
   geom) < 16093

Here function distance(here_geom, there_geom) returns meters. Notice how you would read in a point(long/lat) from string using PointFromText() then do the transform.

However, this query isn't optimal. It calculates distances from zip 92230 to all other zipcodes. We would rather have a bounding box of 10 miles around our interested point, filtering out the points that interact with this box then calculate the exact distances to the center. Sounds like a lot of work but in reality, it can be achieve by adding another condition:


SELECT state, zip
FROM zipcode
WHERE
geom && expand(transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661), 16093)
AND
distance(
   transform(PointFromText('POINT(-116.768347 33.911404)', 4269),32661),
   geom) < 16093


The new query is much much faster. My benchmark shows 200% in speed compared to the first query. Pretty sweet!