Infrastructure, infrastructure, infrastructure!
We've recently released our new 3.0 client for the iPhone (Android is coming soon!) and got really incredible feedback on it. However, some of you shared your feelings with us, telling us that our priorities are wrong. Instead of a redesigned client or new features you think that we'd better focus on our infrastructures first as building new stuff on top of a shaky base does not make much sense.
While infrastructure is a general term and includes different services, we are fully aware that a lot of the things we promised from day one, are not working as they should. This is mostly true regarding our "daily process" which includes points, client tiles update and map editing. In fact, from your perspective they have not improved over the last year or so, despite us telling you that "we are working on it".
The fact is that the majority of our dev team is actually working on our infrastructure and we have not stopped that work in order to add new features. We have always been constantly redesigning and deploying new improvements to our backend servers. So why aren't you seeing any real world improvements? There are several reasons for that:
Ehud
While infrastructure is a general term and includes different services, we are fully aware that a lot of the things we promised from day one, are not working as they should. This is mostly true regarding our "daily process" which includes points, client tiles update and map editing. In fact, from your perspective they have not improved over the last year or so, despite us telling you that "we are working on it".
The fact is that the majority of our dev team is actually working on our infrastructure and we have not stopped that work in order to add new features. We have always been constantly redesigning and deploying new improvements to our backend servers. So why aren't you seeing any real world improvements? There are several reasons for that:
- We are growing fast. While a lot of things have not improved they have also did not deteriorated which means we did improve on our capacity. But not fast enough. What we saw is that every time we deployed a new redesigned backend service, in just a few days new users would come and fill in the available capacity.
- Some improvements are transparent. Twice in the last year we had a big down time of several hours. We rely on AWS to run our services and when they had issues our service went down. So we have been working on the last few months (and still are) on redeploying all of our core services in a redundant configuration where each service runs on at least two AWS availability zones. That way if one zone goes down our service will keep running on another zone.
- Serialized processes and dependencies. Our "daily process", historically was a serialized process which involved analyzing users tracks, updating points (partially based on the tracks analysis), updating tiles, switching over to the new tiles, etc'. Since every such task depends on another one, every issue in every task causes a delay in the whole process. So while we have been improving and optimizing the different parts of the chain, there was always a weak link causing delays.
- Deployment and testing. Deploying infrastructure changes is a slow process as the effect of a bad deployment is pretty much disastrous.
- Cartouche. We have recently relaunched the new cartouche with a new UI which was a big improvement over the old one. But this was not just a UI project. We have also rewritten the backend on which the old cartouche was running. The new backend allows us to simply add more and more servers as the demand grows and I'm sure you noticed that it runs much faster. We are still limited by the amount of saves we can process per second but for now we have a lot of room to grow there.
We launched the new backend sooner than we planned as the old backend just couldn't handle the load and we had no way to scale it. Unfortunately we still have some issues we are looking into and while we solved a lot of them our cache currently has some bad data which we need to rebuild. We plan to start working on that in a few days and hopefully this will solve a lot of the inconsistencies you encounter.
Until we rebuild everything, in a few days we will launch a new update which will bring back the detailed error messages. These messages will help you figure our what's wrong with your edits and in many cases fix the issue and resave. - Points. We have been working on breaking most of the dependencies which were required for our points calculation. The calculations rely on a huge amount of events which our different services produce. As part of scaling out our backend, we are now using Hadoop to process it. That way we can easily add more instances as the amount of data grows. Our first goal is to be able to predictably update the points every day without relying on other processes. Most of the development was done and we plan to deploy it in a week or two. Our next step, which have already started to work on, is to be able to update the points in near real time. More updates on that as soon as we finish our next phase design.
- Merging you tracks. We have previously rewritten the process which merges tracks and it is already able to scale out by just adding more processes. However, only now we have finally developed a supporting system which can actually accommodate and distribute the data for merging. This now enables us to do some parts of the merging process in near real time. Our goal is that within a hour from a drive you will see your track on cartouche along with any new roads and cameras you added. You will also get permissions to edit along your drive immediately.
Some parts of the merging process such as detecting turn restrictions and road directions is still done in offline once a day. But we are working on improving that part too and make it near real time as well.
The new system is under final testing and is planned to be deployed within a month. - US/World deployment differences. Map problems are still shown only on the US system. We are planning to deploy it on the world servers as well but I don't have an ETA yet. Our policy is not to have differences as it makes our life harder to manage different configurations, however in this case a gap was created and we are trying to close it as soon as possible.
Ehud
Re: Infrastructure, infrastructure, infrastructure!