State of the Service, 2006-08-15
Overview
What follows is a
status update for the OSAF Hosted Service. Questions and comments are welcomed.
Since about early June 2006, OSAF has applied resources towards the incubation of a hosted service integrated into the OSAF product roadmap. After about 2 months, here's where we are.
The primary function of the hosted service will be to provide a
long-lived installation of Cosmo to the public for free, integrated as needed with Internet infrastructure such as email.
The Hosted Service shares mission points with all OSAF teams:
- Get people outside OSAF using ecosystem
- Deliver something useful and integrated by Beta around March of 2007
- Enable collaboration workflows of the OSAF 1.0 target users
Scheduling
The scheduling guidelines in use include:
- Support Beta: It makes sense to launch something in support of and in sync with OSAF Beta in March 2007. The Hosted Service is working towards that specific calendar target.
- Beta is roughly complete: No major architectural layers should be deployed during the Beta timeframe; foundation components and patterns that are supposed to be in place to grow at 1.0 should be in place for testing by Beta. In particular, load balancing should be in place.
- Incremental deployment: We should be able to have a reasonable start with a small number of machines and add more later.
Right now, Hosted Service
releases are being planned as monthly events, numbered M1, M2, M3, etc. So there are about 9 milestone releases for the Hosted Service before Beta in early 2007.
Within that schedule, the major design and planning areas that seem to be consuming us are:
- Email: What server-side support will be needed to support email integration workflows?
- Scaling: What is needed to take a real-world deployment of Cosmo to 100K-1M+ users cost-effectively?
- Customer support: How to keep support costs for the service low?
- Release planning: given available time and resources, what can realistically be implemented?
Email and scaling design are proceeding primarily in mailing lists. Release planning is proceeding primarily on the wiki from information on the lists.
Dependencies and integration points
The capital and labor needed to support the Hosted Service are highly dependent on the number of users using the service and the features in use. Both of these are heavy unknowns. To support a
steady addition of user signups without keeling over, we're striving to build an architecture that allows the
incremental addition of servers, up to probably a
maximum of 40 machines. We'd like to demonstrate conclusively before Beta that at least 2 databases and 3 app servers can be applied to distribute the computational load. (Major new infrastructure is likely to be needed at say 1M+ users.)
The two primary functional areas,
email and scaling, each have major blocking issues.
- Email: Availability of Morse Code mechanism to upload new item to Cosmo collection
- Scaling: Availability of Hibernate backend to support load-distribution and all planned clustering features
The impact of these blockers is that little Hosted Service prototyping and deployment testing can be performed at the current time. It's likely to be
3+ months before these features are generally available in Cosmo's trunk code. This affects the Hosted Service schedule by pushing back integration and testing to be very close to the Beta, which is a concern to be tracked carefully. Additionally, given early investigations are not possible, the effort and capital budgets have a high-degree of uncertainty.
What can be done about the tight schedule? The best available Hosted Service plan is:
- Try to move forward everything that can be done now: performance testing framework, ticket tracking, possibly other customer self-service
- Advocate for Hibernate backend availability early. At the first possible working Hibernate checkin, the hosted service should be immediately helping test in the background.
- It's possible that significant portions of service functionality will be delivered during Beta. This is a feasible approach, as long as number of service users does not grow beyond the capacity of a pre-load-distribution architecture.
Practical ways to minimize the impact of the current email+scaling lull are actively being sought. Suggestions welcomed.
These seem to be low-hanging fruit. As they are also not blocked on other items, they are active work items.
- Admin/VM server: Getting a decent server dedicated to the Hosted Service will allow prototyping of much of Beta architecture as virtual machines. With this approach, we'll hold off most physical machine purchases throughout much of Alpha, and can later turn individual nodes from virtual to physical machines as the real-world growth demands. The purchase of this machine has been made and was recently delivered.
- Customer support: We know customer support costs are important to operations, so two customer support initiatives can be undertaken without being blocked on Cosmo: Ticket tracking and customer forums. Forums will be important to foster customer self-service. Ticket tracking is important for efficiently handling those issues which do generage a support request. Ticket tracking is also the foundation of allowing multiple support people to field customer inquiries.
- Service centralization: Merge cosmo-demo/scooby-demo into one instance under a temporary brand
Email and scaling plans
Given what we do know so far, what are we specifically planning to do for email and scaling?
Email plans:
- Email submission of items to collections: Currently assuming that an "email robot" will be deployed uniquely to the service to bridge between the open Internet
- Email account hosting: There has been discussion of having free email accounts provided by the service. Some combination of send/outbound/SMTP, receive/inbound/IMAP/POP, webmail, SSL, spam/virus protection services may need to be built and/or managed.
Scaling plans:
- Increase raw performance substantially via Hibernate backend for Cosmo
- Split user data between physical databases
- Include support in Cosmo to look up with physical database to use for each request
- Keep Cosmo stateless so to support clustering architecture
- Place user profile and session persistence in a centralized database
- Utilize distributed/clustered cache objects in Tomcat
Customer support
The general issue of customer support subsumes the issue of how well supported the hosted service is, both now and in the future? How fast will outages be detected and fixed? Are customer requests being responded to quickly?
Customer support for the Hosted Service is a tricky issue. In-house customer support takes people-power and real dollars to pay people to respond to issues.
At the time of this writing, the Hosted Service is very small and composed primarily of users with OSAF. There is an emergency ticket system available, and multiple people will be paged 24x7 if an emergency ticket is submitted. Most outages of the *-demo servers are noticed and generate an alert (though not necessarily a page). This ticket system is maintained by KEI IT and the addresses are available to staff only, not end-users at this time.
When there are enough servers and customer support request to support 3-4 full time staff members in operations, the situation will be relatively clear: the system will be closely monitored 24x7, there will also be a 24x7 escalation channel, and there will be full-time people whose job it is to answer customer support emails.
Between now and 4 full-time staff, the support question is less clear. There are three tactics in use to address this "bridge" period.
- Leverage KEI IT services and place as many servers as possible under primary responsibility of KEI IT support
- GNi, our colocation provider, will be engaged at some point to provide systems-level monitoring and response 24x7. This will be a little rocky but may help improve response times and coverage with requiring in-house full-time hires.
- We will be growth-driven. The plan is to get by with what support resources we have for now, and when the support load grows sufficient, at that point hire or outsource or take other action to bring the support hassle under control.
Beta rough guess
Here's a current-best shot at what's in the Hosted Service Beta:
- Cosmo: Supported, production instance of Cosmo provided to the public for free
- Load balancing: User data can be partitioned between physical databases and app servers can be added one-at-a-time to the app server pool
- Management dashboard: Web page with some core product metrics and graphs
- Email submission of items to collection
- Backups: Nightly copies of all data in system
- In-house NOC: Network Operations Center run almost-24x7 in-house between OSAF and KEI IT
- SSL: Not normally considered a feature, there's enough operational impact from SSL so as to want to plan towards it specifically
- Final branding: Right before Alpha, we'll be able to merge with the overall OSAF branding and move onto hopefully-final domain names before Alpha launch
- Mostly virtual hosting: The Alpha prototype will be built on top of virtual servers; by Alpha some servers will be set up as separate physical machines
- Load-tested cluster: The Alpha configuration will have been performance tested sufficiently to have a reasonable estimate of total capacity at Alpha launch
- Ticket tracker: A best-practice tool to help smooth time-consuming customer support emails
After Beta, these things would be added:
- Failover hardware: More layers of hardware can tolerate failure. It's not yet we'll have sufficient quantity of users to justify the expense of full hardware redundancy by Beta.
- 24x7 coverage: Service level standards of 15-minute response time to emergency situations 24x7.
- Firewall: Dedicated firewall layer for all hosted service hosts
- Customer self-service: Mechanisms such as public forums which enhance customers ability to solve their own problems (keeping support costs low)
Beyond the horizon
There's a number of things that feel like they are hard to see with any precision right now:
- Revenue generation
- Further capital purchases
- Specific capital or labor required for Beta or 1.0
- Number of Hosted Service customers at different stages
- Post-Beta features such as mobile-specific support, SMS support, etc