Jun 17 2010

Adding LINQ and Lucene.Net Support To CouchDB with Relax

Category: Open Source | Symbiote | NoSQLAlexRobson @ 09:40

As I mentioned in my last post about ETL fun with CouchDB and Relax, I’m trying to get Relax to an RC state. A big part of that is trying to wrap up a few concerns that our development group at work has around using CouchDB as the application store. One of the biggest concerns we’ve had is surrounding programmatic querying.

Trade-offs

As .Net developers, we’re “spoiled” with a really nice RDBMS in SQL Server. It’s great being able to get data based on any criteria my heart desires. But the cost of their flexible and powerful query language is a pre-defined schema. Mapping the domain model to a relational schema is a non-trivial and continuous challenge.

In the NoSQL world, we get schema-less storage which makes it possible to store anything that can be expressed in the document format. In CouchDB (and many others) use JSON (java script object notation) which is simple, clean, expressive and even allows storage of hierarchical data.

The trade-off specific to document stores such as CouchDB, is that writing queries against the document store "all willy-nilly” becomes expensive and somewhat unpredictable regarding results. While CouchDB provides the concept of views using map/reduce against the documents, it’s not something that can be defined on the fly and there are different performance and storage consequences introduced with these views.

Lucene (.Net)!

I’ve been hearing about Lucene for years. I really never read much about it and I’ve never worked on a project where it was in use. I recently learned about another CouchDB .Net API that provided LINQ-ish support to CouchDB via the couchdb-lucene project on GitHub. I was initially heartbroken to learn the Lucene is a Java project and that couchdb-lucene required installing the JVM. Ew. But that heart break was quickly remedied when I learned that like many great Java libraries, there’s an equally good .Net port.

At this point, I realized that Lucene wasn’t something I could afford to remain ignorant of. Lucene is basically an indexing and search engine. It’s really good at what it does. I was delighted with how quickly I went from not knowing anything about to Lucene to being able to actually code against the API.

While couchdb-lucene is very cool in terms of what it can do, it wasn’t enough for me. First off, I wanted to build the indexing engine on my own terms. Second, I wanted a lot more control over how I could interact with that engine. Third, I want not part of the JVM and I know that our CTO and admins don’t want it in their production environment. If I wanted to use Lucene with Relax, I was going to have to build what I wanted.

I Did It My Way

I started with a self contained service that would handle both the indexing responsibility as well as the query processing specific to Relax. The more I worked at it, the more I realized that I should actually separate the Lucene.Net interfaces and services. The nice thing about this is that I’ve created a general purpose Symbiote Lucene project that would allow anyone to add Lucene capabilities to their applications/services very easily. While I will talk more about that package in another post, the gist of Symbiote’s Lucene library is that it provides an observer pattern for feeding it documents to index and has a simple query service for retrieving a list of scores and documents for a Lucene Query.

The LINQ Cherry On Top

Everyone loves to hear “It has LINQ support”. I’ve provided a way to turn a lambda predicate into Lucene query syntax. I wrap this in API calls but I also expose the class so that you can take any predicate and turn it into the Lucene Query.

This is not a LINQ provider model. Instead, Relax lets you express your query as a predicate passed to a call:

var matchingDevelopers = repository.GetAllByCriteria<Developer>(x =>
 (x.Name.StartsWith("Al") || x.Title.Contains("Developer")) && x.Age >= 30 && x.Projects.Any(p => p.Users.Count > 3));

So let me explain a bit of what you’re looking at. This snippet assumes that Developer is class type that we’re storing in CouchDB and that it has a Projects collection and that each Projects collection also has a Users collection.

Currently Relax has support for

  • order of clause evaluation in the predicate using parenthesis
  • string extension methods StartsWith, EndsWith and Contains
  • standard comparison operators ==, !=, <=, and >=
  • the Any extension method, which allows for expressions against child collection characteristics as long as they are stored with the document.
  • expressions against the Count property of collections because Relax’s Lucene indexer is adding that as an indexed field of the document.

Not Quite Ready

While I feel the Symbiote Lucene project doesn’t need much additional work, I do have a good bit of work left on the service aspect that handles the actual indexing of CouchDB and hosting the query service. Eventually, I’d like to be able to provide direct integration of the query service to CouchDB so that you can issue Lucene queries against CouchDB via the RESTful API directly.

At this point in time, Relax has an index and query Windows service that handles indexing via Relax’s change stream API and hosts a RESTful query endpoint via Symbiote’s Restfully project. While these work fine for a proof of concept, I want more options regarding how all this is hosted and controlled. As I said above, it would be great to offer direct CouchDB integration similar to what couchdb-lucene does without the same amount of configuration overhead.

Lastly, I need to improve the test coverage on all of this since it’s lame-ish. Developers like Josh Bush have been a huge help in tracking down issues and proposing solutions. Thanks, Josh!

If You Just Want To Peek

The code is currently available via the lucene branch on github: http://github.com/arobson/Relax/tree/lucene. Please keep in mind that there are definitely breaking changes coming to the Relax.Lucene project and Relax itself. The good news is that it’s improving daily!

Tags: , , , , ,

Jun 15 2010

ETL To CouchDB With Symbiote, Relax and Reactive Extensions

Category: Open Source | .Net Framework | SymbioteAlexRobson @ 11:47

I’ve been working on Relax a lot lately. I’ve recently added a Lucene.Net Symbiote project which Relax then uses to provide document indexing and LINQ queries for CouchDB (more about that in another post). A very important part of getting Relax to an RC is understanding how it all behaves under high load.

But what’s high load? We generally target SMBs or internal applications which aren’t going to see social networking kinds of stress. Still, I like knowing the upper bounds of what I’m working on.

I think if you’re familiar with this type of problem and you’re familiar with Rx, just looking at the code samples is probably all you need to appreciate what this is doing. This was a learning experience for me, and I was so happy with what a drastic improvement Reactive Extensions allowed me to easily introduce, I wanted to share it.

Finding A Good Source Of Data

I chose the Stack Overflow dataset (you need to scroll down to find the link to the ClearBits link). Though I often disagree philosophically with Jeff Atwood and Joel Spolsky say, Stack Overflow is a good thing and I really admire the SO team for sharing their data.

I’m only using the posts file atm, which is > 2 million records and the file size is roughly 2.8 GB. I think that’s plenty of data for what I need : )

The Best Way To Bulk Load In Relax / CouchDB

CouchDB provides a bulk document API which allows us to store multiple documents at once in order to save on the overhead involved in the persistence call. Relax makes extensive use of this API behind the scenes. In this case, we want to be able to batch several thousand documents together to persist at once to minimize the overhead cost.

The other thing to note is that CouchDB handles concurrent load exceptionally well (at least from my experience) and so I want the save commands firing off asynchronously as soon as the batch is ready.

Let The Fun Begin

The SO data is all in XML. Yay. This would allow me to use an XML reader to stream through the file and create documents. I’m doing this through an IObservable implementation. I use a base abstract class that provides me with my standard IObservable code. It’s nothing magic, but here’s the source for the sake of clarity:

public abstract class BaseObservable<TNotification> 
    : IObservable<TNotification>, IDisposable
{
    protected ConcurrentBag<IObserver<TNotification>> observers { get; set; }

    public virtual void Notify(TNotification notification)
    {
        observers.ForEach(x => x.OnNext(notification));
    }

    public virtual void SendCompletion()
    {
        observers.ForEach(x => x.OnCompleted());
    }

    public virtual IDisposable Subscribe(IObserver<TNotification> observer)
    {
        var disposable = this as IDisposable;
        observers.Add(observer);
        return disposable;
    }

    protected BaseObservable()
    {
        this.observers = new ConcurrentBag<IObserver<TNotification>>();
    }

    public void Dispose()
    {
        while (observers.Count > 0)
        {
            IObserver<TNotification> o;
            observers.TryTake(out o);
        }
    }
}

Now for the important part: the observable XmlReader:

public class PostReader
    : BaseObservable<XElement>
{
    protected string xmlExportPath { get; set; }

    public void Start()
    {
        using(var stream = new FileStream(
                    xmlExportPath, 
                    FileMode.Open, 
                    FileAccess.Read, 
                    FileShare.None, 
                    2048, 
                    true))
        {
            using(var reader = XmlReader.Create(stream))
            {
                reader.MoveToContent();

                while(reader.Read())
                {
                    if(reader.NodeType == XmlNodeType.Element && reader.Name == "row")
                    {
                        var element = XElement.ReadFrom(reader) as XElement;
                        Notify(element);
                    }
                }
            }
        }
        SendCompletion();
    }

    public PostReader(string xmlExportPath)
    {
        this.xmlExportPath = xmlExportPath;
    }
}

Basically, all I’m doing is reading in each row element (the row element represents a Post item), creating an XElement, notifying the observer(s), and sending the complete signal after the entire file has been read.

Why Bother With The Reactive Extensions?

Starting off, I didn’t know how many records I was looking at. I did know I didn’t want to deserialize everything into memory first and then save because that’s a waste of time, waste of RAM and wouldn’t be easy to parallelize. I also know from experience that my bottleneck is IO. Spinning up async tasks faster than the tasks can complete creates memory issues and, in this case, out of memory exceptions.

I’m not suggesting you can’t handle all this without Reactive Extensions. I am suggesting you won’t be able to do it as elegantly or as simply without them.

Enter Reactive Extensions

The Reactive Extensions (or Rx) is a library from Microsoft DevLabs and created by Erik Meijer and his team of ninja assassin developers. Rx and now RxJS are two projects you really ought to be learning about. And yes, that dizzy feeling you’ll get is normal; the human brain isn’t meant to take in so much distilled awesome.

Rx makes it easy to program against asynchronous event streams. Take a moment to think about that and let it sink in…

Making Friends With IObservable

Get comfortable with IObservable because it’s the core of Rx. I like to think of IObservable as a message pump. Eric Meijer likes to compare IObservable with IEnumerable: essentially he sums it up as IEnumerable is a pull mechanism and IObservable is a push mechanism. Rx helps bridge the gap between functionality and tooling for pull mechanisms and push mechanisms and in some cases allows us to interchange the two.

Enough Talk

I’m going to show you the rest of the code and then break it down. Each section has a header so if you’re not interested in that portion of the source, just skip ahead.

class Program
{
    static void Main(string[] args)
    {
        Assimilate
            .Core()
            .Daemon(x => x
                .Arguments(args)
                .Name("SOBulkLoader")
                .DisplayName("Stack Overflow Bulk Loading Service")
                .Description("Does what it says"))
            .Relax(x => x.UseDefaults().TimeOut(1000000))
            .AddConsoleLogger<LoadingService>(x => x.Info().MessageLayout(m => m.Date().Message().Newline()))
            .RunDaemon();
    }
}

public class LoadingService
    : IDaemon
{
    protected IDocumentRepository repository { get; set; }
    protected XmlSerializer postSerializer { get; set; }

    public void Start()
    {
        "Loading service starting"
            .ToInfo<LoadingService>();

        Action<IList<XElement>> saveAction = SaveChunk;
        var loader = new BulkPostLoader(@"e:\stackoverflow\062010 so\posts.xml");
        var batches = loader.BufferWithCount(5000);
        var results = batches.Select(x => saveAction.BeginInvoke(x, null, null));
        
        results
            .BufferWithCount(5)
            .Subscribe(x => x.ForEach(y => y.AsyncWaitHandle.WaitOne()));

        loader.Start();
    }

    protected void SaveChunk(IList<XElement> x)
    {
        var list = x.Select(ProcessPost).ToList();
        repository.SaveAll(list);
        "Posts {0} to {1} chunked and saved"
            .ToInfo<LoadingService>(list.First().Id, list.Last().Id);
    }

    public Post ProcessPost(XElement element)
    {
        var content = element.ToString();
        return postSerializer.Deserialize(new StringReader(content)) as Post;
    }

    public void Stop()
    {
        "Loading service stopping"
            .ToInfo<LoadingService>();
    }

    public LoadingService(IDocumentRepository repository)
    {
        this.repository = repository;
        this.postSerializer = new XmlSerializer(typeof (Post));
    }
}

Program Main (A Shameless Symbiote Plug)

If you haven’t seen it before, this is a Symbiote Assimilation call: a centralized, fluent API for configuring multiple open source frameworks. In this code, I’m initializing Symbiote with a call to .Core() (that’s always required). Next I define the service I’m creating using Daemon. Then I’m using the default configuration for Relax and changing the timeout to 1k seconds (more that I need). I’m also adding a Log4Net console logger and telling it how I want the message layed out. Lastly, I’m starting the Daemon. The great thing is that Symbiote is registering everything (including configuration and all the different project dependencies) with StructureMap, which has a lot of good implications

What’s an IDaemon?

The Daemon project takes TopShelf and makes it super easy to create windows services. IDaemon requires Start and Stop methods and Symbiote handles the rest (like dependency injection, etc.) While TopShelf is really for Windows Services, I use it for all my console applications because it adds some really nice things and it’s simple to use.

IDocumentRepository

This is the primary interface for storage and retrieval of documents in CouchDB. I’m taking a dependency on it which is supplied by Symbiote when it instantiates and runs the service.

Putting Rx To Work

The load variable is an instance of the PostReader class and takes the path to the posts xml file. From that, we use the BufferWithCount extension method from Rx to produce a new IObservable<IList<XElement>>. At this point I hope you picked up on two things: 1) I haven’t called loader.Start() yet, so nothing is happening. 2) loader is an instance of IObservable<XElement> but calling BufferWithCount produces an IObservable<IList<XElement>> meaning that it transforms messages from the origin to a list of messages of the requested size.

It’s about to get more awesomer. Now that I have an observable that will produce messages containing a list of 5k messages, I want a way to asynchronously queue transforming and persisting these. Calling select against the batches observable lets me kick off an asynchronous call to the SaveChunk method (via the delegate defined earlier). This produces a new IObservable<IAsyncResult> so now we have an observable list of asynchronous results. The usefulness may not seem readily apparent, but remember BufferWithCount? I can use that same call to batch IAsyncResults, then subscribe to each batch of five, get the wait handles and block until all calls in the batch have completed.

Once I have created the IObservables and set everything up, I then tell my loader to start. Everything up to that call is wiring up and defining how I want to handle the XElements as they’re produced. Since I don’t want any XElements getting lost, I don’t start the  message pump until everything’s in place.

The Result

Running all this on my local development (moderate Core 2 Duo) laptop yields 100,000 inserts per minute for 15 minutes. Memory utilization is hardly noticeable, this process is actually CPU bound due to the transform from XML to class type being fairly expensive.

To put these metrics into some perspective; the highest tweet-per-minute average this month (88k) could be imported real-time into CouchDB via Relax without any special tuning or hardware.

Tags: , , ,

Jun 10 2010

A Peek Inside My Brain

While my recent work on Symbiote and Relax probably appears to be all over the place, there is a unifying, underlying purpose behind all the work I’m doing. This post is about my short and long term goals. It’s about the technologies and architectures I believe are going to become important in the not-too-distant-yet-not-immediate future.

Who I’m Building For

I’m primarily building tools for our development team at work. We have seven developers (myself included), which work on multiple projects for internal and external customers. We target SMBs, usually via SaaS solutions.

I’m hopeful that more developers on similar teams with similar needs will find that the Symbiote libraries provide a simple, easy way to adopt some of the great open source frameworks available.

Tenets

If I had a technical manifesto, it would ridiculously opinionated, long and in printed form, might be used to even out furniture with wobbly legs. It would also talk a lot about the following tenets:

1. Open standards are the way forward. Proprietary is bad.
2. Distributed architectures will be the best way to take advantage of the new advances in hardware.
3. Open source alternatives to bloated, closed technologies will become vital to small/medium development shops.
4. “Teh Cloud” will become a great place for solutions which aren’t built from proprietary pieces.
5. APIs that are “discoverable”, provide extensibility via dependency injection, and are built around conventions but provide configuration provide the most value. (shameless plug for Symbiote)

Technologies and Architectures I <3

Symbiote isn’t a complete list, but it’s a good start of the architectures and technologies I’m excited about. RabbitMQ, CouchDB and Lucene are three technologies I love. CQRS, messaging, and RESTful are just a few of the architectures that can produce powerful and agile solutions.

Technologies I Avoid

Knowing when a technology is bad for you is a skill that I’ve learned the hard way. It’s been painful. It’s been costly. My poor coworkers are probably developing mental defense mechanisms due to my tendency to wax bitter about certain technologies that infamous for burning projects and small teams to the ground.

Bloated, proprietary, and closed systems are a bad fit for most projects I’ve ever worked on. If you’re going to use technologies built for fortune 500 IT organizations, you need to BE a fortune 500 IT organization. These kinds of technologies are career paths unto their own. If you’re a smaller shop, you generally can’t spare entire people to these things alone.

What I’m Focused On Now

Our team is investing a good bit of time developing technologies around CouchDB. Why? It performs well. Schema-less storage is a huge time-savings for us because it doesn’t require a ton of up-front design and it ‘evolves’ gracefully with our domain model (at least so far). The team is going to need to address certain things that CouchDB doesn’t do out of the box. Those things are:

1. Handling relationships between document types
2. Open search capabilities. Writing views for everything a user may want to search on isn’t practical.
3. Reporting capabilities. Josh Bush and Jim Cowart are doing a brilliant job in this area.

That said, I’m trying to fast track a Relax RC that provides indexing and query services using Lucene. I have a proof of concept for those services and I’m also working on a LINQ provider.

Where To Get More Information

Following me on this blog, or on twitter are good starts. If you’re feeling adventurous and want to play with the code, check out http://github.com/arobson and see the Symbiote and Relax repositories. There is currently a wiki at http://sharplearningcurve.com/wiki and I’m also (slowly) working on a site dedicated to Relax documentation, features and updates.

Tags: , , , , , ,

Apr 13 2010

Symbiote Daemon Screencast

Category: Open Source | Symbiote | ScreencastAlexRobson @ 19:15

I’m hopeful that screencasts will provide a quick way for developers to see how quick and painless it is to use Symbiote.

In this screen cast, I demonstrate how to create Windows Services using Symbiote Daemon.

Symbiote Daemon Screencast

 

Since this is my first screencast, it’s probably rough around the edges, as usual, I’m interested in constructive feedback : )

[Edit]

A Little More About TopShelf

TopShelf is a part of the MassTransit library created by Dru Sellers and Chris Patterson. MassTransit is a “lean service bus implementation for building loosely coupled applications using the .Net Framework”. They separated TopShelf out from MassTransit and thus made this excellent piece of software available to folks like me who wanted an easier way to build Windows Services.

TopShelf allows you to create Windows Services from a console application by hosting your services classes for and by providing command line arguments to trivialize testing, installing and uninstalling your TopShelf based services. See the TopShelf website for documentation on the command line arguments.

Stay Tuned…

Here are a few of the planned Symbiote screencasts to look for in the near future (I’ll turn these into links as they become available):

  • Fluent Log4Net
  • CouchDB
  • RabbitMQ
  • MVC and the Spark View Engine
  • Web Socket server
  • SocketMQ – a Web Socket to RabbitMQ bridge

 

Other Links

Introducing Symbiote - http://sharplearningcurve.com/blog/post/2010/04/12/Symbiote-e28093-Reducing-The-Radius-Of-Comprehension.aspx

The Symbiote wiki – http://sharplearningcurve.com/wiki

Get it at GitHub – http://github.com/arobson/Symbiote

Tags: ,

Apr 12 2010

Symbiote – Reducing The Radius Of Comprehension

Category: Open Source | Tools | SymbioteAlexRobson @ 05:49

Recently, Mike Taylor talked about his concerns about the increase of all the frameworks and libraries in his Pragmatic Bookshelf article, Tangled Up In Tools. If I understand his point correctly, he’s concerned that developers are too quick to adopt a library or framework which may or may not actually be a good fit and could actually cost the team time and functionality that they wouldn’t have lost had they just built what they needed. At the end of his post, he introduces the term ‘radius of comprehension’ which he defines as such:

“Radius of comprehension” is a new term that I am introducing here, because it describes an important concept that I don’t think there is a name for. It is a property of a codebase defined as follows: if you are looking at a given fragment of code, how far away from that bit of the code do you need to have in your mind at that time in order to understand the fragment at hand?

That’s an excellent concept (thanks Mike!). I think it’s one of those things that many programmers will read and think, “Exactly! I just didn’t know what to call it!”. Sure, we’ve had terms like “usability”, “intuitive” or “clean” but those terms are very subjective. Radius of comprehension is objective, you read the definition and you realize that it gives you a better construct for evaluating an API.

Introducing Symbiote

I said recently that I have been working for four months on a new open source set of libraries. Collectively, I think of it as the Symbiote framework. I feel like it’s a fairly unique attempt at taking a lot of different application concerns and technologies and reducing the radius of comprehension required to use these things in your application. It’s not a framework in the sense of, “Just plug in Symbiote and watch it handle everything”, rather it’s a way to centralize the configuration, dependency injection and simplify (or provide) APIs to some of the better open source libraries available.

Symbiote is more about providing useful APIs than enforcing a design methodology or approach. Furthermore, because *everything* in Symbiote uses a DI system and design by contract, you can replace pretty much anything you like with your own implementations. Symbiote isn’t designed to tie you down. It’s meant to provide value via utility.

Uh, Did You Actually Read Mike’s Article?

Yes, I did. The thing is that those of us employed as application/solution developers with business customers (internal or external) know that there’s a significant rift between the ideal circumstances in which to develop software and the realities we face each day which impede the development of software which provides value, meets needs and generally makes everyone happy.

My point here being, you need good tools. How else are you going to avoid technical debt while focusing on the features and aspects of the software which are perceived by your customers as providing value? Either you have a huge team, you take longer to release or you make technical sacrifices which will hurt the longevity/suitability of the solution. It’s that cut and dry.

Simple, Focused APIs For Open Source Libraries

The name is meant to indicate the over-all architecture of the entire framework. The framework consists of several focused libraries which take a dependency on the Symbiote.Core project. The idea is that you only add reference to one of the other Symbiote libraries to gain a specific piece of functionality.

Where It’s At

Symbiote is under heavy development. That means APIs are subject to change. While I’ve managed to isolate a lot of the changes behind the public facing interfaces, that’s not to say you shouldn’t expect some libraries to see more change than others until a library is stable. I’m also working very hard to improve test coverage and provide documentation. You can peek at the bits out on github: git://github.com/arobson/Symbiote.git and you can checkout what’s on the wiki so far at http://sharplearningcurve.com/wiki. The source includes some demo applications which really just help to serve as integration tests as well as a way to introduce the functionality provided.

What’s To Come

I plan on trying to blog fairly frequently about Symbiote libraries and functionality. There’s a lot to Symbiote so it’s going to take a while to cover everything it can do. If you’re impatient or really curious, I recommend pulling the bits down and playing with what’s there. I plan to blog soon about the CouchDB and

As always, I’m interested in feedback. You can e-mail me at asrobson AT gmail DOT com or follow me on Twitter, A_Robson.

Tags: ,

Apr 8 2010

To All The Frameworks I’ve Started And Abandoned

Category: Open Source | ArchitectureAlexRobson @ 07:56

For years now I’ve been trying to create an open source framework that would address many common application development concerns while reducing the amount of time required to get a project started and on it’s way. My professional experience has taught me that development teams are almost never given enough time and are constantly having to cut corners, incur a lot of technical debt, and choose technologies which aren’t a good fit simply because they don’t have time to incorporate anything new.

While I’ve learned a lot throughout the process and written some interesting code, my efforts have largely resulted in generous shipments of fail. The good news? I’ve tried very hard to learn from those failures and make them count.

Nvigorate

The first project I started was called Nvigorate. It centered largely around an ORM. Yep. I’m that kind of crazy. Thanks largely to people like Craig Israel, the ORM actually got to a really decent feature set. The primary issue was that 2 or 3 developers can’t rival the existing code base, time and team around NHibernate. Not to mention that NHibernate has nearly become the standard in shops where open source is an option. I think I gave up right around the time that I learned that NHibernate also had nice fluent mapping API and a LINQ API and could support certain inheritance scenarios that I knew would take a very long time to work into my code base. I learned a lot of valuable lessons trying to make Nvigorate into something viable. Here are just a few:

1. Frameworks that don’t make use of dependency injection are gargbage
2. Beware internal dependencies on other pieces of the framework
3. Create focused, decoupled libraries where possible
4. Lack of unit test coverage can kill a project dead

BourneFramework

This wasn’t really my framework but I contributed a lot to it. Basically, I contributed way too much to it and realized that I was trying to do a lot more than was originally intended. Bourne was initially intended to bring NHibernate integration into the different tiers: mvc, wcf, services, etc. I made the mistake of trying to make it do way to much and as a result the configuration aspect of the API became a big, awkward, fragmented and difficult to explain or use. So from this I learned:

1. Don’t try to make a framework do more than it was initially designed to do, it won’t make “sense”
2. Make the configuration and setup simple as you can
3. One way to configure an API is best.

This leads me to the new open source framework I started. I think it’s got promise. I’ve been working on it for about 4 months now and we’ve been using pieces of it at work. From the feedback I’ve received so far, I think it’s actually easy to learn and easy to use and provides value. I’ll be talking about it more on the blog very soon.

Tags: ,

Jan 22 2010

My Crash Course In High Performance NHibernate

It’s never good when your boss appears in your office unexpectedly to tell you that the deadline you thought was a few days out is actually tomorrow. It’s also not good when it happens right after your analyst informs you that the system you thought was producing valid output was actually built on an oversimplification that was only just discovered. It’s especially bad when the model you’re working against is supposed to be crawling a payroll system with insufficient metadata to support the business rules. This particular model is very complex. So complex that there are professionals who dedicate their entire career just to understanding this single facet of their industry.

Welcome to my hell, circa yesterday morning. The problem is that the process I wrote to handle all this in the first place was already written under a relatively aggressive deadline. This is my preface for telling you that I wrote a crappy console app to “get-er done!”. The issue is that the sheer volume of data, coupled with the awful schema we inherited, coupled with the complex business rules and model made for a very slow loading of the better part of the database into memory so my code would be able to handle all the calculations and recreation of new structures which would then be saved back to newer (still fairly complex) schema in the database. This wonderous and unnatural process took anywhere from 1.5 to 2 hours to complete. Still, as of last Friday, we thought we were in great shape…

The real issue with a long running process like this is that when a problem is identified, you have to identify the root cause, adapt the model/logic, test, then complete a full run. When there’s a 2 hour overhead in that process, it gets really, really painful. Now I wasn’t just on the hook for this one thing, so it’s not like I’d been able to give this my full attention. I ignorantly thought “this is good enough for now…”

I’m always saying what a good team we have here. Evan Hoff and Jim Cowart really helped me a lot. In one 18 hour day we managed to turn this slow crappy process into a fast crappy process (about 4 to 5 times faster). I also have to give credit to Oren Eini for making the wonderful NHibernate Profiler, a tool no dev wishing to remain sane should be without. Anyway, here’s what I learned:

The NHibernate.Linq Library Is Dangerous
You should only use it for fun time. The eager loading does not work correctly. In situations where you don’t care about lazy loading additional child collections, it’s worked just fine for me. I actually still use it for those cases because it’s type-safe and compile time checked for typos : )

You Can Die From Lazy Loading
Lazy loading ain’t free. It doesn’t seem like it would be a huge deal but when you have a model that’s > 2 levels deep with more than just on or two nodes off each aggregate root, lazy loading will kill you dead.

Use The Future Query API To Eager Load
This is awesome. Fortunately Evan had just read Oren’s latest blog entry on this. With some HQL experimentation we figured out how incredibly powerful this is. Sadly, HQL is just a flipping string so it’s easy to mess us. The NH error messages were good enough to point me in the right direction. Read Oren’s post
here and the HQL chapter here.

Second Level Caching Is Not Your Friend For High Volume
This wasn’t what I expected but sure enough, turning off the second level cache made the writes back to the database go much, much faster. Calling flush on the session was taking seconds just for a few persists until we eliminated the second level cache.

Use The Reflection Optimizer For High Volume
There is some up-front penalty here but it did help performance. If you’re using Fluent NH like I am, it’s a simple .UseReflectionOptimizer() call during the fluent database configuration step.

You Need One Session Per Thread And Objects Cannot Be Shared Across Sessions
To get this monstrosity running faster we needed to make all this processing happen concurrently. Unfortunately, this process was very complex in how the new object model was created. Certain objects needed to be created and shared across models on an as needed basis. Before parallelizing it, I was able to store these shared objects in a hash and wrap access to them in a nice little function call that abstracted away the fact that I was creating them if they didn’t exist and retrieving them if they did.

This does not work when you’re spinning up threads with a session per thread (this is required) because as soon as you try to associate the shared instance across more than one session, NH breaks. Here’s how we got around this limitation:

Implement a double checked lock pattern so that you have a dictionary of locks per shared object id and a lock that protects access to that dictionary. When the consumer asks for a specific shared object by id, you check the database first. If the object wasn’t there, then you lock on the outer dictionary lock and then check to see if a lock exists for that shared object id. If it doesn’t you create a lock and store it by the requested object id. After that, you lock on that newly created object for the id, check the database again and if there is still now record, you create the object, save it and exit the lock. If it was in the database, you simply return it. Here’s some demo code to reinforce that messy explanation:

private object _dictionaryLock = new object();
private Dictionary<int, object> _sharedObjectLock = new Dictionary<int, object>();

public bool GetSharedInstanceFromDB(ISession session, int id, out SharedObject instance)
{
    instance = session.Linq<SharedObject>().FirstOrDefault(x => x.Id == id);
    return instance != null;
}

public SharedObject GetSharedInstance(ISession session, int id)
{
    SharedObject instance = nulll;
    if(!GetSharedInstanceFromDB(session, id, out instance)
    {
        lock(_dictionaryLock)
        {
            if(!_sharedObjectLock.ContainsKey(id))
                _sharedObjectLock.Add(id, new object());
        }
        lock(_sharedObjectLock[id])
        {
            if(!GetSharedInstanceFromDB(session, id, out instance)
            {
                // code to create instance
                session.Save(instance);
                session.Flush();
            }
        }
    }
    return instance;
}

Far from simple, but for us, unfortunately, it was necessary. The nice thing about this is that it gives you a way to multi-thread session access and still share a common object between threads without causing session collisions.

DO NOT USE IDENTITY COLUMNS! AHHHHHHHHH
We used identity columns : \ I’ve pretty much always been against them because I don’t like the idea of my database telling me what the identifier for my records are. I like to have control (does that make me crazy?). NH pros will tell you to use Hi-Lo or something like that which allows your clients to create unique, yet arbitrary ids for your tables. Why does it matter?

Well, unlike my now dead ORM, NHibernate does not attempt to write your FK values from parent objects in one go. Instead it will do a follow-up Update to all the child rows to provide the database-specified parent Id when you’re using identity columns. This can get very expensive and chatty, very, very quickly. On the other hand, if you’re specifying the id in your client code, it’s already available to the child FK rows. IGNORE THIS ADVICE AT YOUR OWN FLIPPING PERIL. Sadly, we can’t just change all the schema and models at the last minute, but it’s definitely something I will take with me moving forward.

 

And that’s all I have to say about that. Hope it’s helpful : )

 

 

Tags:

Jan 19 2010

The Wrong Tools

Category: Open Source | Architecture | ToolsAlexRobson @ 13:32

Have you ever heard some truism or principle, immediately thought, “Exactly! It’s so simple and obvious!” then looked around at your colleagues and exchanged some knowing laughter. Maybe you even made fun of the poor bastards who didn’t get it? “Huh-huh-huh, yeah, like this guy, Durfin, he’ll NEVER GET IT!”

Well, whenever I used to hear things like “don’t use a hammer to drive screws”, “use the right tool for the job”, etc., I always reacted like that. Like a terd. Never bothering to really reflect upon whether or not my blind application of all things Microsoft was right. I am moving the other direction now and trying to really examine my assumptions about what’s necessary vs. what’s common. Instead of just being all abstract and gushy though, I’m going to be more specific and go over common assumptions that myself and others in Microsoft country have been stumbling over for quite some time now.

If You Need RPC, WCF Is Great!

In my experience, WCF is great if you enjoy spending a lot of time on high-learning-curve, low-return, frustrating, bloat-ware. I won’t revisit all the magical ways WCF can bite you, but I will say that WCF users generally fall into three groups:

1. Geniuses. These folks have no idea why anyone would complain about WCF.
2. The Frustrated. That includes people like me.
3. The clueless. These people think they’re in the first group. They’re using WCF in an over-simplistic, “look, ma, I made codez!”, demo quality applications which create more jobs for people in the first and second group when it all goes to hell.

My point is that WCF is solving a problem that doesn’t really exist outside of the Microsoft ecosystem. With the right tools, you can accomplish the same end-goal in much more simple and elegant ways:

Erlang – has the ability to communicate natively with distributed Erlang nodes built into the language itself. It’s accomplished via a trivial one liner that takes the form: <address> ! <message>. Simple. Elegant. Powerful.

AMQP and XMPP are open protocols which relate specifically to messaging. Lots of stuff use these two. Google’s Wave is built on an extension to XMPP. My favorite AMQP implementation is RabbitMQ (written in Erlang).

Take away here – WCF is overkill. It’s a jackhammer for jeweler’s screws.

If You Need To Store Something, Microsoft SQL Is King!

Microsoft SQL is a good RDMS. Hard to argue that. What are relational stores good for? Analytical processing of data. Most applications I’ve worked on actually need analytics but most of the time is spent figuring out how to make my application model save its state in a schema that’s well suited to analytics.

The NoSQL movement is good and bad. The problem is too many reactionaries (myself included) hear NoSQL and think “YAY, NO MORE RDBMS!!!” That’s actually a terrible idea for many business line applications or anything you need analytics for. If you want reporting, you’re going to find that writing reports against a document store is probably just as torturous as mapping a model to normalized tables.

Use the right tool for the right job. Use the document databases for your transactional store and use relational databases for your analytics. You can do both. (I’m not going to go into that right now). The important thing is realizing that it’s not all or nothing in one approach.

SOA Is The One True Architecture

SOA gets overused so much that it means different things to different people. Many developers, formerly myself, believed that SOA meant lots of atomic services which performed certain tasks and that combining these services was how you built out an application. For those of you who don’t have infinite time and resources to maintain your application, may I suggest that this implementation of SOA is really just 3 tiered architecture which rarely scales out well and is fairly difficult to architect, develop and maintain.

Instead, I would point you to Gregor Hophe, Udi Dahan and others that would suggest SOA is something very different. Messaging based architectures allow you to add services which look more like nodes on a bus. Each node performs a specific role and the beauty here is that the things putting messages on the bus and the things reacting to messages coming off the bus don’t have to know *anything* about each other. They only need to know about the message types and the bus. Ta-dow.

How All Of This Relates

I’m still working on that. In my last post I talked about the fallacy of thinking I was at least aware of everything I needed to know about. These are some of the things I thought I knew which were instrumental in leading me down a path of high frustration, long hours and low return.

I’m not saying I currently have everything figured out, but I am saying that looking into other tools and languages has helped me realize that there are far better ways to solve the problems that applications developers face daily. I’m blogging about this stuff because it’s exciting to me and I feel like it’s information worth passing around.

Tags: , , , ,

Nov 4 2009

Chasing The CI Grail - ESXi, Debian and Git

Category: Open Source | ToolsAlexRobson @ 07:00

As I said in the intro, this series of posts is all about me trying to find a solution that I like for continuous integration. It’s about the search. It’s about the learning process. It’s about seeing how many times I run into a wall before getting out of the maze or discovering there’s no end. The title is a dead give-away but in case you missed it, my journey begins using trying out git for source control.

Goal
Last week I decided it was time to actually look into what it would take to get an internal git solution together. If you’re new to this blog, you might not know that I love git. Well, I do. What I didn’t love were my options for running it on Windows. I don’t mind the idea of developers making their root git directories shared amongst the Active Directory Developer’s group. I actually think that’s an acceptable way to provide read-only access so that others can pull/clone/fetch/etc from their peers. But for my primary CI build server, I want a separate, more tightly controlled git repository so I know when and how the source gets there.

Git
Git has 3 ‘bits’ to it: git (core), git-daemon and git-web. Each of these rely on POSIX. That means to run them in Windows you’d have to try something like CYGWIN. Now, I’m not knocking CYGWIN, but it sounded too much to me like I was trying to force something unnatural. Why not just run git the way holy Torvalds intended on an OS of his own grand design?

Git core is what you’ll find the most documentation about because by itself, it’s a complete source control system. Git-daemon is a service that you can use to natively serve up git repositories to the outside world. It’s really not ideal for anyone doing development on Windows boxes though because it expects that anyone interacting with it will have accounts on the box/workgroup/domain that the host OS knows about. Git-web is a terrific web front end for exploring git repositories that runs on Apache. I can’t really say enough nice things about it because I’ve literally never seen anything integrated that easily and perform so well. My favorite feature so far? You can grep a repository’s files from the web interface and it’s lightning fast.

Where To Get It
I’m using Debian, so I apt-get installed my way there. It’s a simple, clean install experience and you don’t even have to recompile the kernel. If you’re curious about using git on windows, I can’t recommend msysgit highly enough.

Debian
Debian was actually not my first choice for OS. I was going to try several different distributions that I had experience with, starting with OpenSUSE but in the years that have passed since I tried OpenSUSE they’ve made some changes to it that made it difficult to get git setup according to some of the useful guides I’ve found.

After several disappointing hours of fiddling with OpenSUSE, I decided to try Debian based on the recommendations of Dave Purdon and Evan Hoff. I have to say, I really like Debian. It’s clean, simple and everything’s where I expect it to be (and where most git-related walk-throughs tell you they’ll be).

Where To Get It
The best way to get your hands on it is using microTorrent and the torrent(s) for Debian. CD disc 1 is sufficient to get your install started although you will end up pulling a few things down from a mirror of your choice to complete the install.

ESXi
I never questioned that I’d be running these servers as virtual machines. And even though I have experience with ESX and how awesome it is, I don’t currently have access to the company ESX server. So to get started I was considering workstation or virtual box or something and Evan Hoff recommended I look into ESXi, VMWare’s free-esque ESX package. If you’re unfamiliar with either, ESX[i] is a Type 1 hypervisor meaning it runs on bare metal. This means no stinky host OS to gobble up your precious resources. ESXi requires you only use it on boxes with 1 processor (up to 6 cores) amongst other things, so I re-provisioned my desktop as my ESXi server. It’s down-right dreamy. ESXi is missing some of the more spectacular and magical things that it’s commercial big brother can do, but heck; it’s FREE.

Where To Get It
ESXi is available here and you do need to register for it if you would like to continue using it after 60 days. The vSphere client software can actually be downloaded from a URL on your ESXi server once it’s up and running.

Important Note:
If you happen to try it out for yourself, beware that you will have to hack about with the Vitural Sphere client to get it to run on Windows 7. (read about how to do that here)

Next Steps
So, shortly after getting ESXi installed, I spun up a VM for Debian, got Debian installed, apt-get installed git, git-daemon and git-web and then promptly ran into a wall. Next post I’ll talk about getting past the road block and go into a lot more step-by-step detail (as well as provide more links and resources).

Tags: ,

Oct 22 2009

The Bourne Framework – A High Level Introduction

Category: .Net Framework | Open SourceAlexRobson @ 03:16

For a little over a month now, I’ve been contributing to an open source project started and architected by Evan Hoff. After the week of the project, I started bugging him about when I could blog about the project. The project is called the Bourne Framework, and it’s changing the way I write code*. The one thing I should make abundantly clear is that Bourne is new and subject to change. The good news is that it’s also very usable in its current state.

Technology Stack
A lot of the framework comes from Evan’s past experience with several open source projects. Most of them are fairly widely known:

  • NHibernate
  • FluentNhibernate
  • NHibernateLinq
  • StructureMap
  • MassTransit
  • TopShelf (part of MassTransit)
  • log4Net

The framework uses these open source libraries in order to provide out-of-the-box infrastructure for the following types of .Net applications:

  • ASP.Net MVC
  • WCF
  • Windows Services

What It Does
It’s difficult to summarize without just throwing around buzz-words. I would say that it does an excellent job of tying together leading open source frameworks by providing an integrated infrastructure for configuration and application of these libraries. I feel that that’s particularly invaluable to developers who want to use the best technologies available but don’t necessarily have the time and/or resources available to do deep dives on all of them. It doesn’t and can’t completely abstract everything away. in fact, anyone who has experience in these open source frameworks knows that you’ll still need to understand what they do and how to use them in general. The difference between using them on your own and using Bourne is that Bourne gives you a really nice structure and simplifies the configuration experience while reducing LoC, something that I’m a big fan of.

How Do You Learn It?
Bourne has a fairly respectable set of unit tests as well as some demo code included in the source. That’s a good place to start. I am going to start a blog series where I go through several different (very simple) types of applications which show off some of Bourne’s features. I also plan to make the source for all of these demos available on GitHub.

Where Can You Get It?
Bourne Framework is on GitHub. Evan’s url is git://github.com/therealhoff/BourneFramework.git and mine is at git://github.com/arobson/BourneFramework. Evan hasn’t had time to review all the code I’ve added to it, so if you want the purest Bourne, check out his repository first.

 

*Which, if you’ve seen my code, is a really good thing : )

Tags: ,