It’s never good when your boss appears in your office unexpectedly to tell you that the deadline you thought was a few days out is actually tomorrow. It’s also not good when it happens right after your analyst informs you that the system you thought was producing valid output was actually built on an oversimplification that was only just discovered. It’s especially bad when the model you’re working against is supposed to be crawling a payroll system with insufficient metadata to support the business rules. This particular model is very complex. So complex that there are professionals who dedicate their entire career just to understanding this single facet of their industry.
Welcome to my hell, circa yesterday morning. The problem is that the process I wrote to handle all this in the first place was already written under a relatively aggressive deadline. This is my preface for telling you that I wrote a crappy console app to “get-er done!”. The issue is that the sheer volume of data, coupled with the awful schema we inherited, coupled with the complex business rules and model made for a very slow loading of the better part of the database into memory so my code would be able to handle all the calculations and recreation of new structures which would then be saved back to newer (still fairly complex) schema in the database. This wonderous and unnatural process took anywhere from 1.5 to 2 hours to complete. Still, as of last Friday, we thought we were in great shape…
The real issue with a long running process like this is that when a problem is identified, you have to identify the root cause, adapt the model/logic, test, then complete a full run. When there’s a 2 hour overhead in that process, it gets really, really painful. Now I wasn’t just on the hook for this one thing, so it’s not like I’d been able to give this my full attention. I ignorantly thought “this is good enough for now…”
I’m always saying what a good team we have here. Evan Hoff and Jim Cowart really helped me a lot. In one 18 hour day we managed to turn this slow crappy process into a fast crappy process (about 4 to 5 times faster). I also have to give credit to Oren Eini for making the wonderful NHibernate Profiler, a tool no dev wishing to remain sane should be without. Anyway, here’s what I learned:
The NHibernate.Linq Library Is Dangerous
You should only use it for fun time. The eager loading does not work correctly. In situations where you don’t care about lazy loading additional child collections, it’s worked just fine for me. I actually still use it for those cases because it’s type-safe and compile time checked for typos : )
You Can Die From Lazy Loading
Lazy loading ain’t free. It doesn’t seem like it would be a huge deal but when you have a model that’s > 2 levels deep with more than just on or two nodes off each aggregate root, lazy loading will kill you dead.
Use The Future Query API To Eager Load
This is awesome. Fortunately Evan had just read Oren’s latest blog entry on this. With some HQL experimentation we figured out how incredibly powerful this is. Sadly, HQL is just a flipping string so it’s easy to mess us. The NH error messages were good enough to point me in the right direction. Read Oren’s post here and the HQL chapter here.
Second Level Caching Is Not Your Friend For High Volume
This wasn’t what I expected but sure enough, turning off the second level cache made the writes back to the database go much, much faster. Calling flush on the session was taking seconds just for a few persists until we eliminated the second level cache.
Use The Reflection Optimizer For High Volume
There is some up-front penalty here but it did help performance. If you’re using Fluent NH like I am, it’s a simple .UseReflectionOptimizer() call during the fluent database configuration step.
You Need One Session Per Thread And Objects Cannot Be Shared Across Sessions
To get this monstrosity running faster we needed to make all this processing happen concurrently. Unfortunately, this process was very complex in how the new object model was created. Certain objects needed to be created and shared across models on an as needed basis. Before parallelizing it, I was able to store these shared objects in a hash and wrap access to them in a nice little function call that abstracted away the fact that I was creating them if they didn’t exist and retrieving them if they did.
This does not work when you’re spinning up threads with a session per thread (this is required) because as soon as you try to associate the shared instance across more than one session, NH breaks. Here’s how we got around this limitation:
Implement a double checked lock pattern so that you have a dictionary of locks per shared object id and a lock that protects access to that dictionary. When the consumer asks for a specific shared object by id, you check the database first. If the object wasn’t there, then you lock on the outer dictionary lock and then check to see if a lock exists for that shared object id. If it doesn’t you create a lock and store it by the requested object id. After that, you lock on that newly created object for the id, check the database again and if there is still now record, you create the object, save it and exit the lock. If it was in the database, you simply return it. Here’s some demo code to reinforce that messy explanation:
private object _dictionaryLock = new object();
private Dictionary<int, object> _sharedObjectLock = new Dictionary<int, object>();
public bool GetSharedInstanceFromDB(ISession session, int id, out SharedObject instance)
{
instance = session.Linq<SharedObject>().FirstOrDefault(x => x.Id == id);
return instance != null;
}
public SharedObject GetSharedInstance(ISession session, int id)
{
SharedObject instance = nulll;
if(!GetSharedInstanceFromDB(session, id, out instance)
{
lock(_dictionaryLock)
{
if(!_sharedObjectLock.ContainsKey(id))
_sharedObjectLock.Add(id, new object());
}
lock(_sharedObjectLock[id])
{
if(!GetSharedInstanceFromDB(session, id, out instance)
{
// code to create instance
session.Save(instance);
session.Flush();
}
}
}
return instance;
}
Far from simple, but for us, unfortunately, it was necessary. The nice thing about this is that it gives you a way to multi-thread session access and still share a common object between threads without causing session collisions.
DO NOT USE IDENTITY COLUMNS! AHHHHHHHHH
We used identity columns : \ I’ve pretty much always been against them because I don’t like the idea of my database telling me what the identifier for my records are. I like to have control (does that make me crazy?). NH pros will tell you to use Hi-Lo or something like that which allows your clients to create unique, yet arbitrary ids for your tables. Why does it matter?
Well, unlike my now dead ORM, NHibernate does not attempt to write your FK values from parent objects in one go. Instead it will do a follow-up Update to all the child rows to provide the database-specified parent Id when you’re using identity columns. This can get very expensive and chatty, very, very quickly. On the other hand, if you’re specifying the id in your client code, it’s already available to the child FK rows. IGNORE THIS ADVICE AT YOUR OWN FLIPPING PERIL. Sadly, we can’t just change all the schema and models at the last minute, but it’s definitely something I will take with me moving forward.
And that’s all I have to say about that. Hope it’s helpful : )
Tags: nhibernate