CARVIEW |
Operations
Data Center Power Efficiency
by Jesse Robbins
James Hamilton is one of the smartest and most accomplished engineers I know. He now leads Microsoft's Data Center Futures Team, and has been pushing the opportunities in data center efficiency and internet scale services both inside & outside Microsoft. His most recent post explores misconceptions about the Cost of Power in Large-Scale Data Centers:
![]()
I’m not sure how many times I’ve read or been told that power is the number one cost in a modern mega-data center, but it has been a frequent refrain. And, like many stories that get told and retold, there is an element of truth to the it. Power is absolutely the fastest growing operational costs of a high-scale service. Except for server hardware costs, power and costs functionally related to power usually do dominate.
However, it turns out that power alone itself isn’t anywhere close to the most significant a cost. Let’s look at this more deeply. If you amortize power distribution and cooling systems infrastructure over 15 years and amortize server costs over 3 years, you can get a fair comparative picture of how server costs compare to infrastructure (power distribution and cooling). But how to compare the capital costs of server, and power and cooling infrastructure with that monthly bill for power?
The approach I took is to convert everything into a monthly charge. [...]
tags: cloud computing, energy, James Hamilton, microsoft, operations, performance, platforms, utilities, utility computing, velocity, velocity09, web2.0
| comments: 8
| Sphere It
submit:
My Web Doesn't Like Your Enterprise, at Least While it's More Fun
by Jim Stogdill
The other day Jesse posted a call for participation for the next Velocity Web Operations Conference. My background is in the enterprise space, so, despite Velocity's web focus, I wondered if there might not be interest in a bit of enterprise participation. After all, enterprise data centers deal with the same "Fast, Scaleable, Efficient, and Available" imperatives. I figured there might be some room for the two communities to learn from each other. So, I posted to the internal Radar author's list to see what everyone else thought.
Mostly silence. Until Artur replied with this quote from one of his friends employed at a large enterprise: "What took us a weekend to do, has taken 18 months here." That concise statement seems to sum up the view of the enterprise, and I'm not surprised. For nearly six years I've been swimming in the spirit-sapping molasses that is the Department of Defense IT Enterprise so I'm quite familiar with the sentiment. I often express it myself.
We've had some of this conversation before at Radar. In his post on Enterprise Rules, Nat used contrasting frames of reference to describe the web as your loving dear old API-provisioning Dad, while the enterprise is the belt-wielding standing-in-the-front-door-when-you-come-home-after-curfew step father.
While I agree that the enterprise is about control and the web is about emergence (I've made the same argument here at Radar), I don't think this negative characterization of the enterprise is all that useful. It seems to imply that the enterprise's orientation toward control springs fully formed from the minds of an army of petty controlling middle managers. I don't think that's the case.
I suspect it's more likely the result of large scale system dynamics, where the culture of control follows from other constraints. If multiverse advocates are right and there are infinite parallel universes, I bet most of them have IT enterprises just like ours; at least in those shards that have similar corporate IT boundary conditions. Once you have GAAP, Sarbox, domain-specific regulation like HIPAA, quarterly expectations from "The Street," decades of MIS legacy, and the talent acquisition realities that mature companies in mature industries face, the strange attractors in the system will pull most of those shards to roughly the same place. In other words, the IT enterprise is about control because large businesses in mature industries are about control. On the other hand, the web is about emergence because in this time, place, and with this technology discontinuity, emergence is the low energy state.
Also, as Artur acknowledged in a follow up email to the list, no matter what business you're in, it's always more fun to be delivering the product than to be tucked away in a cost center. On the web, bits are the product. In the enterprise bits are squirreled away in a supporting cost center that always needs to be ten percent smaller next year.
tags: operations, web2.0
| comments: 18
| Sphere It
submit:
Velocity 2009: Themes, ideas, and call for participation...
by Jesse Robbins
Last year's Velocity conference was an incredible success. We expected around 400 people and we ended up maxing out the facility with over 600. This year we're moving the conference to a bigger space and extending it to 3 days to accommodate workshops and longer sessions.
Velocity 2009 will be on June 22-24th, 2009 at the Fairmont Hotel in San Jose, CA.
This year's conference will be especially important. I've said many times that Web Performance and Operations is critical to the success of every company that depends on the web. In the current economic situation, it's becoming a matter of survival. The competitive advantage comes from the ability to do two things:
Our Velocity 2009 mantra is "Fast, Scalable, Efficient, Available", a slight change from last year. (We've replaced "Resilient" with "Efficient" to make focus clear.)
I'm excited to announce that joining Steve Souders & I on this year's program committee are John Allspaw, Artur Bergman, Scott Ruthfield, Eric Schurman, and Mandi Walls. We've already started working on the program, and have just opened the Call for Participation.
tags: Artur Bergman, conferences, Eric Schurman, John Allspaw, Mandi Walls, operations, performance, Scott Ruthfield, Steve Souders, velocity, velocity09, web2.0, webops
| comments: 0
| Sphere It
submit:
DisasterTech: "Decisions for Heroes"
by Jesse Robbins
One of the most interesting DisasterTech projects I've been following is "Decisions for Heroes" led by developer and Irish Coast Guard volunteer Robin Blandford.
Decisions is like Basecamp for volunteer Search & Rescue teams. The focus is on providing "just enough" process to compliment the real-world workflow of a rescue team, without unnecessary complexity. One of Robin's design goals is that:
User requirements are nil. Nobody likes reading manuals - if we have to write one, we've gotten too complicated.
This is the winning approach for building systems that "serve those that serve others", and is echoed by InSTEDD's design philosophy and the Sahana disaster management system.
Teams begin by entering their responses to incidents and training exercises. They then tag them with things like the weather conditions, the tools and skills required, and who from the team was deployed.
As a team's incident database grows this information can be used to show heatmaps, and provide powerful insight on the locations, weather conditions, and times of year that various incidents occur. Over time this kind of data could be analyzed in aggregate across multiple teams and regions and create an incredibly powerful resource for Emergency Managers. This is very similar to what Wesabe does for consumers with financial transaction data today (disclosure: OATV investment).

Rescue team members enter training dates and levels. The system tracks certification expiration dates and prompts team members & leaders to plan classes and remain current. This is a huge issue for volunteers who have to manage professional-level training requirements with the demands of a regular career.
As more incidents are entered into the system, it compares the skills required for each of the rescues with the team training exercises. This allows teams to identify areas to focus, train, and develop new skills.

tags: disaster tech, disastertech, emergency management, firefighting, humanitarian aid, ict, innovation, operations, rescue, social networking, web 2.0, webops
| comments: 2
| Sphere It
submit:
Sprint blocking Cogent network traffic...
by Jesse Robbins
It appears that Sprint has stopped routing traffic (called "depeering") from Cogent as a result of some sort of legal dispute. Sprint customers cannot reach Cogent customers, and vice versa. The effect is similar to what would happen if Sprint were to block voice phonecalls to AT&T customers.
Here's a graph that shows the outage, courtesy of Keynote :
Rich Miller at DataCenterKnowledge has a great summary of the issues behind the incident, which has happened with Cogent before. Rich says:
At the heart of it, peering disputes are really loud business negotiations, and angry customers can be used as leverage by either side. This one will end as they always do, with one side agreeing to pay up or manage their traffic differently.
I think this is particularly Radar-worthy because it provides an example of the complex issues around Net Neutrality . In this case customers are harmed and most (especially Sprint wireless customers) will have no immediate recourse.
tags: cloud computing, cogent, disruption, innovation, internet policy, network neutrality, operations, sprint, utilities, utility computing, webops
| comments: 3
| Sphere It
submit:
Amazon's new EC2 SLA
by Jesse Robbins
Amazon announced a new SLA for EC2, similar to the one for S3. This is a notable step for Amazon and cloud computing as a whole, as it establishes a new bar for utility computing services.
Amazon is committing to 99.95% availability for the EC2 service on a yearly basis, which corresponds to approximately four hours and twenty three minutes of downtime per year. It's important to remember that an SLA is just a contract that provides a commitment to a certain level of performance and some form of compensation when a provider fails to meet it.
Here's the summary of the EC2 SLA (emphasis added):Service Commitment AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year. In the event Amazon EC2 does not meet the Annual Uptime Percentage commitment, you will be eligible to receive a Service Credit as described below. [...]To receive a Service Credit, you must submit a request by sending an e-mail message to aws-sla-request @ amazon.com. To be eligible, the credit request must [...] include your server request logs that document the errors and corroborate your claimed outage (any confidential or sensitive information in these logs should be removed or replaced with asterisks)
- “Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.” If you have been using Amazon EC2 for less than 365 days, your Service Year is still the preceding 365 days but any days prior to your use of the service will be deemed to have had 100% Region Availability [...]
- “Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances. [...]
This new SLA does not appear to address the reliability of server instances individually or in aggregate. For example, if half of a customer's EC2 instances lose their connections or die every 6 minutes, EC2 would still be considered "available" even if it is essentially unusable.
If the entire EC2 service is down a cumulative four hours and twenty minutes, customers must furnish proof of the outage to Amazon to be eligible for the 10% credit. This seems like an onerous process for very little compensation, and isn't in-line with Amazon's famous "Relentless Customer Obsession". Amazon takes monitoring very seriously and should take the lead by tracking, reporting, and proactively compensating customers when it lets them down.
tags: amazon, availability, cloud computing, ec2, operations, s3, sla, webops
| comments: 7
| Sphere It
submit:
Kaminsky DNS Patch Visualization
by Jesse Robbins
Dan Kaminsky has posted the details of the widespread DNS vulnerability. Clarified Networks created this visualization of DNS patch deployment over the past month:
Red = Unpatched
Yellow = Patched, "but NAT is screwing things up"
Green = OK
tags: internet policy, operations, platform plays, velocity, worries
| comments: 4
| Sphere It
submit:
The new internet traffic spikes
by Jesse Robbins
Theo Schlossnagle, author of Scalable Internet Architectures, gave a great explanation of how internet traffic spikes are shifting:
Lately, I see more sudden eyeballs and what used to be an established trend seems to fall into a more chaotic pattern that is the aggregate of different spike signatures around a smooth curve. This graph is from two consecutive days where we have a beautiful comparison of a relatively uneventful day followed by long-exposure spike (nytimes.com) compounded by a short-exposure spike (digg.com):The disturbing part is that this occurs even on larger sites now due to the sheer magnitude of eyeballs looking at today's already popular sites. Long story short, this makes planning a real bitch.
[...]What isn't entirely obvious in the above graphs? These spikes happen inside 60 seconds. The idea of provisioning more servers (virtual or not) is unrealistic. Even in a cloud computing system, getting new system images up and integrated in 60 seconds is pushing the envelope and that would assume a zero second response time. This means it is about time to adjust what our systems architecture should support. The old rule of 70% utilization accommodating an unexpected 40% increase in traffic is unraveling. At least eight times in the past month, we've experienced from 100% to 1000% sudden increases in traffic across many of our clients.
[Link]
tags: operations, trends, velocity, web 2.0, worries
| comments: 5
| Sphere It
submit:
Video of Rich Wolski's EUCALYPTUS talk at Velocity
by Jesse Robbins
Rich Wolski gave a truly impressive talk at Velocity about an open-source software infrastructure for cloud computing called EUCALYPTUS . The API is compatible with Amazon's EC2 interface, and the underlying infrastructure is designed to support multiple client-side interfaces. EUCALYPTUS is implemented using commonly-available Linux tools and basic Web-service technologies making it easy to install and maintain. Watch and learn...
You can see more videos from Velocity on Blip.tv.
tags: cloud computing, ec2, movers and shakers, open source, operations, platform plays, science, utility computing, velocity, velocity08, videos, web 2.0
| comments: 0
| Sphere It
submit:
Hyperic CloudStatus service dashboard launches at Velocity!
by Jesse Robbins
Javier Soltero just launched CloudStatus during his Hyperic sponsor session today at Velocity. CloudStatus is a public health dashboard for web services like Amazon's EC2/S3, and Google's App Engine.
Javier called to tell me about this last week after I declared that "Service Monitoring Dashboards are mandatory". This comes right after Amazon and Google had visible outages, and couldn't have happened at a better time. I'm really excited to see this idea take off, as it's something that is critical to the broad adoption of web services and cloud computing.
tags: cloudstatus, hyperic, monitoring, operations, outages, platform plays, specialized services, startups, velocity, velocity08, web 2.0, webops
| comments: 6
| Sphere It
submit:
Service Monitoring Dashboards are mandatory for production services!
by Jesse Robbins
Google App Engine went down earlier today. GAE is still a developer preview release, and currently lacks a public monitoring dashboard. Unfortunately this means that many people either found out from their app and/or admin consoles being unavailable or from Mike Arrington's post on TechCrunch.
Google has a strong Web Operations culture, and there are numerous internal monitoring tools in use across the company, along with a smaller set available to customers. It's suprising that Google launched a developer platform without providing something beyond an email group, although they are by no means the first to do so.
Service Monitoring Dashboards are mandatory for production services and platforms!
- If you launch a platform that people pay you money for, you need to have a real time service dashboard. Ideally this should be decoupled from the rest of your infrastructure.
- Don't rely on platforms that lack service monitoring dashboards for production.
Many companies are initially reluctant to provide this kind of monitoring to the public, and only do so in reaction to an outage. However, it seems that every company that offers such a dashboard uses it as a source of competitive advantage.
The best example of this is trust.salesforce.com which they launched after series of outages in 2006. Amazon (eventually) launched a status dashboard for AWS, and added RSS feeds for specific services which I think is pretty cool.
Javier Soltero at Hyperic points out
1. The reports of service outages arrive long after anyone who depends on the services can possibly do anything to mitigate their effect.
2. The services themselves seem incapable of providing any visibility into the circumstances that might lead to future outages.[...]Even TechCrunch points out that the Google Apps blog doesn’t even mention the outage. Other clouds rely on blogs such as this one, this one, or maybe even this one (from our good friends at Mosso). These are all places where outages can be discussed, but not the right means for people to find out whether it their application that crashed, or the cloud that it depends on.
(Updated:Niall Kennedy pointed out that GAE is still a preview release, and I agree that my original wording was wrong. My intent is to emphasize the importance of providing a public service dashboard and so I've edited accordingly.)
tags: failure happens, google app engine, infrastructure, internet policy, monitoring, operations, outages, platform plays, platforms, saas, velocity, web 2.0, web services, webops
| comments: 6
| Sphere It
submit:
Two new open source projects at Velocity
by Jesse Robbins
At Velocity next week there will be two significant open source projects debuting. The first is the Jiffy: Open Source Performance Measurement and Instrumentation tool created by Scott Ruthfield and his team at Whitepages.com.
Most tools for measuring web performance come in two flavors:
- Developer-installed tools (Firebug, Fiddler, etc.) that allow individuals to closely trace single sessions
- Third-party performance monitoring systems (Gomez, Keynote, etc.) that will hit your site occasionally and report back component-level metrics (for a fee)
Neither of these tools give you real-world information on what’s actually happening with your clients—how long are pages really taking to load, what’s the real cost of client-side execution, and what’s the impact of your loading or dependency chain. This is even more important when you don’t host all of your own assets, such as when you load ads or JavaScript from third parties, for example, and you need to monitor their performance.
Thus we built Jiffy—an end-to-end system for instrumenting your web pages, capturing client-side timings for any event that you determine, and storing and reporting on those timings. You run Jiffy yourself, so you aren’t dependent on the performance characteristics, inflexibility, or costs of third-party hosted services.
The second is project is EUCALYPTUS, the Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems, presented by Rich Wolski from UCSB. This project has already started getting attention. (Many thanks to Surj Patel of Structure08/GigaOM for connecting us!)
Eucalyptus is an open-source software infrastructure for implementing "cloud computing" on clusters. The current interface to EUCALYPTUS is compatible with Amazon's EC2 interface, but the infrastructure is designed to support multiple client-side interfaces. EUCALYPTUS is implemented using commonly-available Linux tools and basic Web-service technologies making it easy to install and maintain.
The talk will focus on the design, the implementation tradeoffs we have identified in implementing Eucalyptus as an exploratory tool, and the ways in which we have chosen to address these tradeoffs in the first version of the software.
tags: cloud, cloud computing, ec2, gomez, jiffy, keynote, metrics, open source, operations, performance, platform plays, startups, structure08, velocity, velocity08, web 2.0, web monitoring, webops
| comments: 3
| Sphere It
submit:
Recent Posts
- Understanding Web Operations Culture (Part 1) | by Jesse Robbins on June 14, 2008
- CloudCamp gathering after Velocity | by Jesse Robbins on June 13, 2008
- Bill Coleman to keynote Velocity | by Jesse Robbins on June 11, 2008
- TLS Report grades and reports on site security | by Jesse Robbins on June 9, 2008
- DisasterTech from Where2.0 | by Jesse Robbins on May 30, 2008
- Ignite! @ Velocity: Submit your talks & "war stories"... | by Jesse Robbins on May 29, 2008
- Automated Infrastructure Podcast on IT Conversations | by Jesse Robbins on May 13, 2008
- Structure and Velocity | by Jesse Robbins on May 10, 2008
- Disaster Technology for Myanmar/Burma aid workers | by Jesse Robbins on May 8, 2008
- Missed Twitter Questions from Jonathan Schwartz Interview at Web 2.0 Expo | by Tim O'Reilly on April 26, 2008
- You Become what You Disrupt - (part two) | by Jesse Robbins on April 13, 2008
- Velocity preview at Web2.0 Expo | by Jesse Robbins on April 10, 2008
TIM'S TWITTER UPDATES
BUSINESS INTELLIGENCE
RELEASE 2.0
Current Issue

Where 2.0: The State of the Geospatial Web
Issue 2.0.10
Back Issues
More Release 2.0 Back IssuesCURRENT CONFERENCES

The O'Reilly Money:Tech Conference will be an even deeper dive into the space where Wall Street meets Web 2.0, using technology as a lens to provide a unique view of the most pressing issues facing the industry now. Read more

Connect with Publishing Innovation. True digital publishing means creating, producing, and delivering content that may never appear on a printed page. Read more

ETech, the O'Reilly Emerging Technology Conference, is O'Reilly Media's flagship "O'Reilly Radar" event. Read more
O'Reilly Home | Privacy Policy ©2005-2008, O'Reilly Media, Inc. | (707) 827-7000 / (800) 998-9938
Website:
| Customer Service:
| Book issues:
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.