The OpenJet Project

Tuesday, July 14, 2009

Where do we get our data for comparison?

In case anyone missed it we have been developing a flight-comparison site for the last three months, so lets do some comparing. Hold on, there is something missing. The definition of a comparison is that there should be something to compare and that is where our suppliers should come in. Suppliers in this case can be airlines or travel agencies—in short anyone who can deliver flight data and have a site where you can book the tickets.

In a simplified and uncached form a search goes something like this:

The user starts a search on the Openjet site
The Openjet application tells the TripTrap meta-engine the details of the search
TripTrap translates the request for each supplier and sends it to them in their format
The suppliers return data to TripTrap which translates it to a unified format
The Openjet site fetches the unified data and does its magic

So for this to work we have to have contracts with suppliers and some sort of information. In my earlier work I have been dealing a lot with suppliers for our old meta-engine, and during that I have spent some time thinking on what we will need from them and what they will need from us. Two important areas come to mind immediately.

First comes the issue of technical documentation. In some cases this is quite good and comes with good examples and instructions on how to use a nice web service. In other cases the result may need to be retrieved in some obscure JavaScript-encoded format, directly from a web page, with no documentation whatsoever. To be able to efficiently develop new supplier connections this is something to think about. Would it be feasible to say “no” to suppliers without good documentation? Can we afford to not have a good supplier just because they do not have documentation?

Of course you could always have a contact person to talk to in case you run into problems when developing the supplier connection. This brings me to the second important area, which is to have a technical contact address that is not tied to a specific person. It should also be an address that is not liable to change if the structure of the company changes. There have been too many times in the past when I tried to get hold of a technical contact for a supplier which stopped working, only to find that no-one replied on that address or the mail ended up in marketing and got passed around for two weeks before I got a first reply.

There are of course other pieces of information that are important to have for us and for the supplier we use, but these two areas are the ones that have touched my work the most in the past and I feel that they are key to an efficient development process and bug-handling.

Friday, July 10, 2009

Cluster-wide search

Well, after some refactoring TripTrap is finally commited to the git repository.

Besides a lot of refactorings, rearrangements, broken tests fixes and optimizations the main improvement added recently is that each TripTrap instance in cluster environment is in some sense aware of what other nodes do. We had a principal problem with that issue: each cluster node can perform the same search, even if it's already launched at some other node. That's not a big issue as it's not so often, but such redundancy overloads the system, and it can cost us some money. Nobody likes paying money for exactly nothing.

Some synchronization through DB can be implemented, but that's not a scalable solution. The DB will become a bottleneck in a while. The solution, again, is to use memcached for it and place there a mark like "I'm searching this and that for that supplier, account and request". And delete that mark after search is over. Thus, for polling requests redundancy is avoided at all, for synchronous requests, which are waiting for all possible responses, there's some complex strategy like "if there's such searches on other nodes, first launch local ones, than check again, if some of remote searches completed" and so on. Memcached is rather fast and greatly scalable thing, so no bottlenecks is expected, I guess.

Now there are two different implementations of search, configured in Spring context: cluster-aware one, which uses cluster-wide synchronization and single-instance one, which is simplier and faster.

So it goes.

Tuesday, July 7, 2009

Soft numbers

There is an interesting tendency nowadays, especially among economists, to present things in exact figures. Like they where true.

In particular this is, surprisingly enough, almost always the case when it comes to predictions. Future estimations. How can a future estimation ever be exact?

You often hear economists state things like “We expect a growth of 11% during next 6 months”.

How can he state something like that? What he has in fact done is taking some highly unsure, often estimated, figures. Applied them to one or several economic and/or mathematical models of his choice(!). Finally having selected(!) the most suitable output from them.

Now, how can he claim the outcome from that to be true?! It might be a fair estimate at best, but regarding the exact numbers the only things we can be sure about is that it's definitely not true. In fact we can be infinitely sure we will not hit that exact number.

Same thing comes down to how people make their filter selections on most flight-search sites. The figures they input are preferred figures not exact ones – in the same way the economist is presenting an estimate not an exact prediction. Still, most systems interpret the user input as if it was the exact wish of the user.

If a user for example set a filter to “leave after 08:00”, that's in most cases because he would prefer to leave after that time, rarely an absolute need. If I where to offer him a 200€ cheaper flight at 07:48 – he's very likely to take it.

In OpenJet we are trying to grasp this into the design, and our current approach on it is not to hide the results that most sites would have filtered out, but rather highlight the ones currently within the preferences of the user.

This makes it possible for the user to at all time have an idea of what has been “filtered out” from his listings, helping him to decide whenever it's worth to step up that one hour earlier or not.

Thursday, July 2, 2009

Some notes on the presentation

Now that the prototype presentation/demo has been done, we've had some time to think about how to proceed with the project. We also take quite a bit of useful info with us from the presentation in form of comments from the people attending.

Overall I think that our demo was a success, and I really feel that we have come pretty far during the short period of time we have been working on this project. Not everything is obviously working well enough for a production release, but the main skeleton is there and working. We are all satisfied with the fact that we held this prototype presentation quite early, as it minimizes the risk of doing the "wrong thing". It is very easy to get blind staring at your own code every day, and with this demo we got fresh eyes looking at the application.

One feature that seemed to really catch the attendees interest is how we lookup locations. By using our own database we can supply locations connected with airports very fast, using auto complete features in the input fields. But the really cool thing is that if we don't have the search string in our database, we will ask Google via the Maps API. Usually, Google can find what your looking for (down to a very detailed level I must say). It then replies with the coordinates of the location, and from that, we can fetch the nearby airports.

It was pretty interesting to demo that feature, as it can find really really small villages, and get the airports. So this means that you no longer have to search for an airport, or a city in which you know there are airports etc, now you can simply search for where you want to go, and we'll find you the airports.

Another thing that was appreciated by the people attending our presentation was the fact that we search while you filter the results. This means that we cut alot of the waiting time that you normally get while doing a search. In many cases, the search results will load in an instant.

But people also had some more sceptical comments. And that is where this gets really interesting. As an example, we have tried to redesign the results page to avoid the very common example of "losing your results". So what does that mean then? Well, it is not uncommon on search results pages to use filters to filter out the results that aren't of interest for you. The problem with this though is that you usually see your results disappear from the screen when you change your filters, so it is hard to relate what you see on the screen to alternatives, as the alternatives isn't visible at all. Sometimes it is even worse - the results are filtered with the search itself, returning only a very specific result set. This means you have to perform a new search just to change a price range, or maybe a set of dates.

We tried an alternative approach to solve this problem, where we display quite a lot of results, based on a very broad settings in relation to the users search. We den use the filters on the client side to filter irrelevant results. This means that we never have to perform a new search as long as the user don't want to change travel locations. Visually we chose to hide irrelevant results with the filters. So result items that for example fall outside the users set price range, will fold itself leaving only a header with the price visible. This means that the user won't have to see the irrelevant result item itself, but a header indicating that "for this price, there is a flight that might interest you". This allows the user to play with the filters, and get a picture of how they might need to change their flight plans to get onto a cheaper flight.

We felt that we had found a neat solution to a common problem with this solution. But we got comments that indicated that there might be problems understanding what you see. The way of displaying the results is fairly unconventional, and might not be as easy to grasp as we had first believed. Basically, the comments indicated that it was just to much to take in for a user, being used to a standard top-to-bottom results list.

In cases like this, we are very lucky we held this presentation early. And I mean that for several reasons. We could try out these alternative solutions without extreme risks, as we knew that people would see and comment it before it was too late to change. This means a very strong feeling of freedom. By working in small steps and keeping the process very transparent, we can try these things. If it works, it works, if it doesn't - no huge harm done.

There is no real risk of pushing a lemon into production - people will simply let us know before it gets too far. But that is if we get it wrong. This freedom also means that we might come up with something new, and it will be really good - just because we have the room to take these "risks".

The OpenJet project is in fact a very big project, but we treat it kind of like a smaller project, focusing on the small things one by one. I am sure that we will succeed with this project as it is not only about us developers, it is also about all the other great people we work closely with. And these are the people that give us invalueable input during our presentations.

Saturday, June 27, 2009

Martin mentioned the unique ID generation in previous post, so I recalled the problem which was solved in similar fashion. At the beginning of developing TripTrap the issue "how to find out if such request was already processed?" arose. Well, that's quite easy: store it to DB and next time check if it's already there. What becomes a problem is "how to do that fast?", especially when we have almost twenty request parameters (and thus such a number of conjunct conditions in SQL-query), also different types of requests possible, and also some of the parameters can be omitted, etc.

The solution was the following: count 64-bit integer hash for the request and store request to DB and to memcached together with that hash. Stored request is just the binary serialized and gzipped blob. So, in DB we have both BLOB field and BIGINT fields. The latter is very good indexed and is virtually unique, and it's very good key for memcached storage.

For each incoming request we count the hash, fetch the blob from cache or DB, deserialize the blob and compare it with the incoming one in Java-code. In the virtually impossible case of the same hashes for different requests the only problem is that we need to fetch, deserialize and compare more that one request blob from DB or the list of request blobs from memcached.

Voila, for memcached with a lot of RAM we get the stuff working as fast as possible.

Thursday, June 25, 2009

Pseudo-Uniqueness

The need to give things unique names is something that is present in many applications, from variable names to database fields. Openjet is no different and we already have two cases where we need the system to automatically generate unique identifiers.

The first thing we need a unique ID for is a visitor. In every part of the system we want to be able to refer to a certain visitor, to be able to retrieve and store information specifically for that user and track what the user does. Secondly, we are going to store points of interest, which have coordinates, a name, and a locale (language and culture setting). These three fields define the point of interest, and if they are the same as one already existing the points are considered equal.

Of course, we could get a unique ID for each item by just inserting them into the corresponding database table and let MySQL auto-increment the ID. In the case of visitors though, we do not want someone to be able to guess other people's IDs so it would have to be more complex than an number that increases by one for each visitor. When it comes to the points of interest we want a globally unique identifier since it might be compared with IDs from other types of locations. Because of this we cannot rely on MySQL to generate the IDs.

So what did we do? Well, I wrote a utility class that creates hashes from input strings of course! Using the MD5 algorithm for hashing, we will get a 32-digit hexadecimal number which will be virtually unique. Now I can hear some of you think "Virtually unique? How can you be sure it will not collide with an already existing ID?". Well the short answer is that we cannot!

If you are interested, the long answer now follows.

If I generate a 32-digit hexadecimal number, it consists of 128 bits. 128 bits can be set in 2^128 different combinations, which means that the chance of getting the same hash for another item is 1 in 340282366920938463463374607431768211456 (39 digits). Lets say we have an unrealistically busy site, generating a million new visitors or points of interest a day. Then after a thousand years we will have generated less than 4*10^11 different hashes. This means that have a chance of 4*10^11 in 2^128 of calculating the same hash at this point which is a probabiltiy in the magnitude of 10^-28.

Maybe our site is really successful and continues until no life can exist on earth (8Gyears from now). Well I am not going to bore you with the calculations for this one, but at that point we would have generated 3.2*10^21 different hashes. This means we would have the overwhelming probability of below 10^-18 of hitting one of the hashes we already used.

This is why I say virtually unique. The chance of getting the same ID is so small that it will probably never happen, and if it does the probability is so small that it probably did not. So I now leave it up to you to decide if this is unique enough for you!

Tuesday, June 23, 2009

Research and Development

There is philosophers who claim that he only possible knowledge is direct knowledge, things we have experienced on our own, with our own senses.

I don't share this view of knowledge, I believe we are highly capable of gathering, abstracting and understanding other peoples experiences, ideas and failures – and build upon them.

We are now at a high speed approaching the big day of this project, the day when we are going to present a first in-some-sense working prototype of the site we are working on.

On a very early stage of this project we decided that background research and deep consideration of every major choice would be, if not the most important, at least one of the most important aspects of our working methodology.

Today I'm glad we made that choice. Especially since none of us had been building this kind of software before. Sure, we had all a good experience in science and software development from before, but not of this particular kind.

So we did invest our first two months in intense research, architecturing, brainstorming and planing. Setting up tools and looking through pretty much anything that seemed relevant; tools, competitors, libraries, books, frameworks, architectures, standards, etc., etc.

Sure, there were times when, at least one of us, felt we might be investing too much time into research and preparations. Feeling that maybe we should have got started with the coding earlier. “Get something done.”

I think he has changed his mind today. The solid research and knowledge base we built up before we got started have made an incredible development speed possible during our last month of coding. An amazingly bump-free development that would for sure not have been possible without all the time we spent reading and looking through options before we got started.

We are today, just three days ahead of the demonstration, quite well prepared. Giving the final touch. With a software far more functional then at least I was expecting us to be able to build in just three months, from nothing and no knowledge.

Of course it's still a very, very limited prototype, and lots of work remain still, but I feel we have done a good job. With Kenny now focusing and working hard those last days to get all the functionalities and ideas we have into some presentable visuals.

I believe this project is just on the right path. I believe that if we are just allowed to continue this project, this is the site that will define the next generation of this industry.

On friday it will be decided whenever or not the management share our believes...