CARVIEW |
Popular topics:
Web Perf/Ops Posts

Caching Strategies for Improved Web Performance
OSCON 2013 Speaker Series
Caching is the method that most improves response time in web applications (as Steve Souders shows in Cache is King), but in order to make use of it, every layer of your application must be configured for that purpose.
Most applications are initially developed with little or no use of caching and then must be refactored to fulfill performance goals. However, this approach incurs extra development costs that could be saved if response time is taken into consideration in the early stages of the development process.
The methodology that can save your life while you are still developing your application is pretty straightforward: keep caching in mind whenever handling data in your system. Either web APIs or internal backend data flows need to ask one simple question:
Can I survive if the data seen by the user is not the latest?
Sometimes the answer to this question is ‘no.’ For example, I would be fired very quickly if I built a bank system that showed more money than one consumer’s account really has. On the other hand, if the system interacts with general data services like social networks, news, weather, car traffic, etc., there is less need to ensure the latest piece of information is immediately shown to the user.
Of course, the latest data needs to eventually get to the user. Data cannot be too old or you risk confusing the user, but configuring a short expiration time (let’s say 5-10 minutes or less) for dynamic data that can support it can significantly improve the response time experience. That is called temporal consistency and it is crucial for having a successful caching strategy in place.
Nowadays, web applications are based on mashing up several web services coming from different sources. The best way to tackle different response times as well as data designs is to temporally cache those elements across all system layers. It is also applicable to data coming from your own system if the information needs to travel from one part of the world to another in several hops. If information is not critical, consider caching it at any intermediate stage and reuse when it is needed. Caching in the backend can avoid half of a trip. Even better would be to cache at the target device or a CDN system that can dispose of the full data trip or reduce it to only the last mile as an easy way to enhance performance.
Read more…

Velocity CA Recap
Failure is a Feature
The Santa Clara edition of our Velocity conference wrapped up a little over a week ago, and I’ve had a chance to reflect on the formal talks and excellent hallway conversations I had throughout. Here are a few themes I saw, including a few of the standout talks:
1. Velocity continues to grow. I had to qualify that I’d been to the Santa Clara conference, because it’s now cropped up in three more locations annually, starting with China and Europe last year, and moving to the newest location this year: New York in October. I’m excited to see what new perspectives this will bring, most notably on the financial industry side of things.
2. The web is getting faster (barely). Steve Souders mentioned this in his keynote at HTML5 Developer Conference (with a related writeup here), and Tammy Evert’s excellent summary of her experience at Velocity provided a slightly depressing list of things that people still aren’t really doing, or are struggling with: third-party content, images, caching, web fonts and… JavaScript. Along with Tammy, I also noticed some varying opinions about how helpful Responsive Web Design really is when it comes to mobile performance. As mobile usage continues to grow dramatically (and the greatest growth is in developing world, on slow cellular networks & basic devices), these pain points are only multiplied by processors optimized for battery consumption vs. CPU. The highlight on this front for me was Ilya Grigorik’s talk on Optimizing the Critical Rendering Path for Instant Mobile Websites (note: I’m wholly biased here, as I’m editing his soon-to-be-released book, High Performance Browser Networking).
3. Perception matters (and page load time doesn’t measure it). Quite a few talks hit on the idea of getting the most critical information in front of people first, and letting the rest load after. (Steve Souders gave a really great Ignite talk on this as well.) And with single-page apps, the very concept of page load goes out the window (pun intended) almost entirely. My favorite talk on this front was Rachel Myers and Emily Nakashima’s case study of work they’d done (previously) at ModCloth. The bottom line: feature load time was a far more useful performance metric for them–and their management team–when it came to the single-page application they’d built. They’d cobbled their own solution together using Google Analytics and Circonus to track feature load time, but it looks like the new product announced at Velocity from New Relic might just provide that out of the box now. Their presentation also had ostriches and yaks for a little extra awesome.
4. Failure is a feature (and you should plan for it at all levels of your organization and products). The opening keynote from Johan Bergstrom provided a fascinating perspective on risk in complex systems (e.g. web operations). While he didn’t provide any concrete ways to assess your own risk (and that was part of the point), what I took away from it was this: If you’re assessing your risk as a function of the severity and probability of technical components of your system going down (e.g. are they “reliable”, you’re missing a key piece of the picture. Organizations need to factor in humans as some of those components (or “actors”), and look at the function of a complex system via the interdependencies and relationships between actors. It is constantly, dynamically changing, and risk is a product of all the interactions within the system. (For more reading on this, I highly suggest some of Johan’s references in his blog post about the talk, notably Sidney Dekker’s work.)
Dylan Richard also gave a fantastic keynote about the gameday scenarios he ran during the Obama campaign. The bottom line: Plan for failure. Design your apps and your team to be able to handle it when it happens.
5. A revolution is coming (and there be dinosaurs). Whither Circuit City and Blockbuster? They didn’t just get eaten by Best Buy and Netflix randomly–they failed to see the writing on IT’s wall. With transformative technologies like the cloud and infrastructure automation, the backend is not so back room any longer. And performance isn’t just about the speed of your site or app. Adam Jacobs gave a talk at the very end of the conference (which Jesse Robbins reprised the next day at DevOpsDays) that was a rallying cry for people in IT and Operations: you control the destiny of your organization. It oversimplified many things, in my opinion, but the core message was there, and something we’ve been saying at O’Reilly for a little while now, too: Every business is now an Internet business. The dinosaurs will be those who, in Adam’s words, fail to “leverage digital commerce to rapidly deliver goods and services to consumers.” In other words, transform or die.
You can see all the keynotes, plus interviews and other related Velocity video goodness on our YouTube channel. You can also purchase the complete video compilation that includes all the tutorials and sessions, as well.

The Power of a Private HTTP Archive Instance: Finding a Representative Performance Baseline
Velocity 2013 Speaker Series
Be honest, have you ever wanted to play Steve Souders for a day and pull some revealing stats or trends about some web sites of your choice? Or maybe dig around the HTTP archive? You can do that and more by setting up your own HTTP Archive.
httparchive.org is a fantastic tool to track, monitor, and review how the web is built. You can dig into trends around page size, page load time, content delivery network (CDN) usage, distribution of different mimetypes, and many other stats. With the integration of WebPagetest, it’s a great tool for synthetic testing as well.
You can download an HTTP Archive MySQL dump (warning: it’s quite large) and the source code from the download page and dissect a snapshot of the data yourself. Once you’ve set up the database, you can easily query anything you want.
Setup
You need MySQL, PHP, and your own webserver running. As I mentioned above, HTTP Archive relies on WebPagetest—if you choose to run your own private instance of WebPagetest, you won’t have to request an API key. I decided to ask Patrick Meenan for an API key with limited query access. That was sufficient for me at the time. If I ever wanted to use more than 200 page loads per day, I would probably want to set up a private instance of WebPagetest.
To find more details on how to set up an HTTP Archive instance yourself and any further advice, please check out my blog post.
Benefits
Going back to the scenario I described above: the real motivation is that often you don’t want to throw your website(s) in a pile of other websites (e.g. not related to your business) to compare or define trends. Our digital property at the Canadian Broadcasting Corporation’s (CBC) spans over dozens of URLs that have different purposes and audiences. For example, CBC Radio covers most of the Canadian radio landscape, CBC News offers the latest breaking news, CBC Hockey Night in Canada offers great insights on anything related to hockey, and CBC Video is the home for any video available on CBC. It’s valuable for us to not only compare cbc.ca to the top 100K Alexa sites but also to verify stats and data against our own pool of web sites.
In this case, we want to use a set of predefined URLs that we can collect HTTP Archive stats for. Hence a private instance can come in handy—we can run tests every day, or every week, or just every month to gather information about the performance of the sites we’ve selected. From there, it’s easy to not only compare trends from httparchive.org to our own instance as a performance baseline, but also have a great amount of data in our local database to run queries against and to do proper performance monitoring and investigation.
Visualizing Data
The beautiful thing about having your own instance is that you can be your own master of data visualization: you can now create more charts in addition to the ones that came out of the box with the default HTTP Archive setup. And if you don’t like Google chart tools, you may even want to check out D3.js or Highcharts instead.
The image below shows all mime types used by CBC web properties that are captured in our HTTP archive database, using D3.js bubble charts for visualization.

Mime types distribution for CBC web properties using D3.js bubble visualization. The data were taken from the requests table of our private HTTP Archive database.
Read more…

Test-driven Infrastructure with Chef
Velocity 2013 Speaker Series
If you’re a System Administrator, you’re likely all too familiar with the 2:35am PagerDuty alert. “When you roll out testing on your infrastructure,” says Seth Vargo, “the number of alerts drastically decreases because you can build tests right into your Chef cookbooks.” We sat down to discuss his upcoming talk at Velocity, which promises to deliver many more restful nights for SysAdmins.
Key highlights from our discussion include:
- There are not currently any standards regarding testing with Chef. [Discussed at 1:09]
- A recommended workflow that starts with unit testing [Discussed at 2:11]
- Moving cookbooks through a “pipeline” of testing with Test Kitchen [Discussed at 3:11]
- In the event that something bad does make it into production, you can roll back actual infrastructure changes. [Discussed at 4:54]
- Automating testing and cookbook uploads with Jenkins [Discussed at 5:40]
You can watch the full interview here:

Application Resilience in a Service-oriented Architecture
Velocity 2013 Speaker Series
Failure Isolation and Operations with Hystrix
Web-scale applications such as Netflix serve millions of customers using thousands of servers across multiple data centers. Unmitigated system failures can impact the user experience, a product’s image, and a company’s brand and, potentially, revenue. Service-oriented architectures such as these are too complex to completely understand or control and must be treated accordingly. The relationships between nodes are constantly changing as actors within the system independently evolve. Failure in the form of errors and latency will emerge from these relationships and resilient systems can easily “drift” into states of vulnerability. Infrastructure alone cannot be relied upon to achieve resilience. Application instances, as components of a complex system, must isolate failure and constantly audit for change.
At Netflix, we have spent a lot of time and energy engineering resilience into our systems. Among the tools we have built is Hystrix, which specifically focuses on failure isolation and graceful degradation. It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such “minor mistakes” that led to major user impact.
This open source library follows these principles in protecting our systems when novel failures inevitably occur:
- Isolate client network interaction using the bulkhead and circuit breaker patterns.
- Fallback and degrade gracefully when possible.
- Fail fast when fallbacks aren’t available and rapidly recover.
- Monitor, alert and push configuration changes with low latency (seconds).
Restricting concurrent access to a given backend service has proven to be an effective form of bulkheading, as it limits the resource utilization to a concurrent request limit smaller than the total resources available in an application instance. We do this using two techniques: thread pools and semaphores. Both provide the essential quality of restricting concurrent access while threads provide the added benefit of timeouts so the caller can “walk away” if the underlying work is latent.
Isolating functionality rather than the transport layer is valuable as it not only extends the bulkhead beyond network failures and latency, but also those caused by client code. Examples include request validation logic, conditional routing to different or multiple backends, request serialization, response deserialization, response validation, and decoration. Network responses can be latent, corrupted, or incompatibly changed at any time, which in turn can result in unexpected failures in this application logic.
Read more…

Ops Mythology
Velocity 2013 Speaker Series
At some point, we’ve all ended up trading horror stories over drinks with colleagues. Heads nod and shake in sympathy, and the stories get hairier as the night goes on. And while it of course feels good to get some of that dirt off your shoulder, is there a larger, better purpose to sharing war stories? I sat down with James Turnbull of Puppet Labs (@kartar) to chat about his upcoming Velocity talk about Ops mythology, and how we might be able to turn our tales of disaster into triumph.
Key highlights of our discussion include:
- Why do we share disaster stories? What is the attraction? [Discussed at 0:40]
- Stories are about shared experience and bonding with members of our community. [Discussed at 2:10]
- These horror stories are like mythological “big warnings” that help enforce social order, which isn’t always a good thing. [Discussed at 4:18]
- A preview of how his talk will be about moving away from the bad stories so people can keep telling more good stories. (Also: s’mores.) [Discussed at 7:15]
You can watch the entire interview here:
This is one of a series of posts related to the upcoming Velocity conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

What Is the Risk That Amazon Will Go Down (Again)?
Velocity 2013 Speaker Series
Why should we at all bother about notions such as risk and safety in web operations? Do web operations face risk? Do web operations manage risk? Do web operations produce risk? Last Christmas Eve, Amazon had an AWS outage affecting a variety of actors, including Netflix, which was a service included in many of the gifts shared on that very day. The event has introduced the notion of risk into the discourse of web operations, and it might then be good timing for some reflective thoughts on the very nature of risk in this domain.
What is risk? The question is a classic one, and the answer is tightly coupled to how one views the nature of the incident occurring as a result of the risk.
One approach to assessing the risk of Amazon going down is probabilistic: start by laying out the entire space of potential scenarios leading to Amazon going down, calculate their probability, and multiply the probability for each scenario by their estimated severity (likely in terms of the costs connected to the specific scenario depending on the time of the event). Each scenario can then be plotted in a risk matrix showing their weighted ranking (to prioritize future risk mitigation measures) or calculated as a collective sum of the risks for each scenario (to judge whether the risk for Amazon going down is below a certain acceptance criterion).
This first way of answering the question of what the risk is for Amazon to go down is intimately linked with a perception of risk as energy to be kept contained (Haddon, 1980). This view originates from more recent times of increased development of process industries in which clearly graspable energies (fuel rods at nuclear plants, the fossil fuels at refineries, the kinetic energy of an aircraft) are to be kept contained and safely separated from a vulnerable target such as human beings. The next question of importance here becomes how to avoid an uncontrolled release of the contained energy. The strategies for mitigating the risk of an uncontrolled release of energy are basically two: barriers and redundancy (and the two combined: redundancy of barriers). Physically graspable energies can be contained through the use of multiple barriers (called “defenses in depth”) and potentially several barriers of the same kind (redundancy), for instance several emergency-cooling systems for a nuclear plant.
Using this metaphor, the risk of Amazon going down is mitigated by building a system of redundant barriers (several server centers, backup, active fire extinguishing, etc.). This might seem like a tidy solution, but here we run into two problems with this probabilistic approach to risk: the view of the human operating the system and the increased complexity that comes as a result of introducing more and more barriers.
Controlling risk by analyzing the complete space of possible (and graspable) scenarios basically does not distinguish between safety and reliability. From this view, a system is safe when it is reliable, and the reliability of each barrier can be calculated. However there is one system component that is more difficult to grasp in terms of reliability than any other: the human. Inevitably, proponents of the energy/barrier model of risk end up explaining incidents (typically accidents) in terms of unreliable human beings not guaranteeing the safety (reliability) of the inherently safe (risk controlled by reliable barriers) system. I think this problem—which has its own entire literature connected to it—is too big to outline in further detail in this blog post, but let me point you towards a few references: Dekker, 2005; Dekker, 2006; Woods, Dekker, Cook, Johannesen & Sarter, 2009. The only issue is these (and most other citations in this post) are all academic tomes, so for those who would prefer a shorter summary available online, I can refer you to this report. I can also reassure you that I will get back to this issue in my keynote speech at the Velocity conference next month. To put the critique short: the contemporary literature questions the view of humans as the unreliable component of inherently safe systems, and instead advocates a view of humans as the only ones guaranteeing safety in inherently complex and risky environments.
Read more…

Beyond Puppet and Chef: Managing PostgreSQL with Ansible
Velocity 2013 Speaker Series
Think configuration management is simply a decision between Chef or Puppet? PalaminoDB CTO (and Lead DB Engineer for Obama’s 2012 campaign) Jay Edwards (@meangrape) discusses his upcoming Velocity talk about Ansible, an alternative configuration management offering that is quick and easy to start using.
Key highlights include:
- Unlike Puppet or Chef, Ansible has no notion of a centralized server. [Discussed at 1:30]
- Ansible lets you get started more quickly and easily by doing everything via SSH. [Discussed at 2:12]
- It’s also good for small-scale projects, such as home or personal things where no persistent state is required. [Discussed at 2:47]
- Configuration in Ansible is all handled via markup in YAML files, so no domain-specific languages (DSL) or Ruby knowledge is required. [Discussed at 3:30]
- Ansible is easily extensible in any language (not just Ruby). [Discussed at 4:50]
- While it’s less relevant for someone with existing configuration management installations, Ansible could be useful in certain cases, such as Puppet without mcollective set up. [Discussed at 6:11]
You can watch the entire interview here:
This is one of a series of posts related to the upcoming Velocity conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

End-to-End JavaScript Quality Analysis
Velocity 2013 Speaker Series
The rise of single-page web applications means that front-end developers need to pay attention not only to network transport optimization, but also to rendering and computation performance. With applications written in JavaScript, the language tooling itself has not really caught up with the demand of richer, assorted performance metrics necessary in such a development workflow. Fortunately, some emerging tools are starting to show up that can serve as a stop-gap measure until the browser itself provides the native support for those metrics. I’ll be covering a number in my talk at Velocity next month, but here’s a quick sneak preview of a few.
Code coverage
One important thing that shapes the overall single-page application performance is instrumentation of the application code. The most obvious use-case is for analyzing code coverage, particularly when running unit tests and functional tests. Code that never gets executed during the testing process is an accident waiting to happen. While it is unreasonable to have 100% coverage, having no coverage data at all does not provide a lot of confidence. These days, we are seeing easy-to-use coverage tools such as Istanbul and Blanket.js become widespread, and they work seamlessly with popular test frameworks such as Jasmine, Mocha, Karma, and many others.
Complexity
Instrumented code can be leveraged to perform another type of analysis: run-time scalability. Performance is often measured by the elapsed time, e.g. how long it takes to perform a certain operation. This stopwatch approach only tells half of the story. For example, testing the performance of sorting 10 contacts in 10 ms in an address book application doesn’t tell anything about the complexity of that address book. How will it cope with 100 contacts? 1,000 contacts? Since it is not always practical to carry out a formal analysis on the application code to figure out its complexity, the workaround is to figure out the empirical run-time complexity. In this example, it can be done by instrumenting and monitoring a particular part of the sorting implementation—probably the “swap two entries” function—and watch the behavior with different input sizes.
As JavaScript applications are getting more and more complex, some steps are necessary to keep the code as readable and as understandable as possible. With a tool like JSComplexity, code complexity metrics can be obtained in static analysis steps. Even better, you can track both McCabe’s cyclomatic complexity and Halstead complexity measures of every function over time. This prevents accidental code changes that could be adding more complexity to the code. For the application dashboard or continuous integration panel, these complexity metrics can be visualized using Plato in a few easy steps.

Doug Hanks on how the MX series is changing the game
Doug Hanks on how the MX series is changing the game
Doug Hanks (@douglashanksjr) is an O’Reilly author (Juniper MX Series) and a data center architect at Juniper Networks. He is currently working on one of Juniper’s most popular devices – the MX Series. The MX is a routing device that’s optimized for delivering high-density and high-speed Layer 2 and Layer 3 Ethernet services. As you watch the video interview embedded in this post, the data is more than likely being transmitted across the Juniper MX.
We recently sat down to discuss the MX Series and the opportunities it presents. Highlights from our conversation include:
- MX is one of Juniper’s best-selling platforms [Discussed at the 0:32 mark].
- Learn if the MX can help you [Discussed at the 1:00 mark].
- What you need to know before using the MX [Discussed at the 6:40 mark].
- What’s next for Juniper [Discussed at the 9:39 mark].
You can view the entire interview in the following video.