| CARVIEW |
Planet Python
Last update: May 15, 2008 02:40 AM
May 14, 2008
EuroPython Conference
Registration for EuroPython 2008 is Now Open!
At last, registration for EuroPython 2008 is now open! Take a look at the registration page on the EuroPython Web site for full details:
https://www.europython.org/community/Registration
The usual generous discount on fees is offered for registrations made up until 31st May. Don’t forget to check the calendar for other important deadlines:
https://www.europython.org/community/Calendar
We look forward to seeing you in Vilnius!

Tarek Ziade
Google AppEngine sprint at Pycon FR
Late (but exciting) news: a sprint room will be held at Pycon FR, by the Logilab team, that published a GPL framework on the top of Google AppEngine, called Logilab Appengine eXtension (lax).
Participant will be able to learn how to play with AppEngine technologies.
All infos (in french): https://fr.pycon.org/programme/sprint-appengine

ivan krstić · code culture
Sic Transit Gloria Laptopi

Photo: Walter shows me improvements to the Record activity at the Lima coastline, Peru.
I’ve been displeased with the quality of community discourse surrounding the recent OLPC announcement of moving to Windows as the OS platform. I decided to withhold comment at the time, and was swayed only by the half-dozen volunteers mailing me personally to ask whether all their work had been in vain. It hadn’t. And then I left to travel for a few days.
I just caught up with my mail and RSS feeds, and what I’ve read has moved me from displeased to angry. So I’m going to comment after all, and it’ll be my last OLPC-related essay for the foreseeable future. But first, some background.
The beginning
Throughout his life, Nicholas Negroponte worked with education and technology luminaries like Alan Kay and Seymour Papert. In the early 80s, Nicholas and Seymour ran a pilot program backed by the French government that placed Apple ][ machines in a suburban computing center in Dakar, Senegal. The project was a spectacular flop due to mismanagement and personality conflicts. In '83, barely a year after the experiment started, MIT's Technology Review magazine published its damning epitaph:
Naturally, it failed. Nothing is that independent, especially an organization backed by a socialist government and staffed by highly individualistic industry visionaries from around the world. Besides, altruism has a credibility problem in an industry that thrives on intense commercial competition.
By the end of the Center's first year, Papert had quit, so had American experts Nicholas Negroponte and Bob Lawler. It had become a battlefield, scarred by clashes of management style, personality, and political conviction. It never really recovered. The new French government has done the Center a favor in closing it down.
But both Nicholas and Seymour emerged from the ashes of the Dakar pilot with their faith in the premise of children learning naturally with computers intact. Armed with the lessons from the Senegal failure, it was perhaps only a matter of time before they tried again.
Indeed, Seymour tried again only a couple of years later: the Media Lab was founded in 1985 and immediately started supporting Project Headlight, an attempt to infuse constructionist learning into the complete curriculum of the Hennigan school, a public elementary school in Boston consisting mostly of minority students.
Fast forward almost two decades, to around 2000. Former Newsweek foreign correspondent turned philanthropist, Bernie "one-man United Nations" Krisher convinced Nicholas and his wife Elaine to join Bernie's program of building schools in Cambodia. Nicholas bought used Panasonic Toughbooks for one school, and his son Dimitri taught there for a time.
"Surely," the thinking went, "there has to be a way to scale this." And the rest of the story is familiar: Nicholas wooed Mary Lou Jepsen while she was interviewing for a faculty position at the Lab, and told her about his crazy idea for an organization called One Laptop per Child. She came on board as CTO. Towards the end of 2005, the organization left stealth mode with a bang: Nicholas announced it with Kofi Annan, Nobel Peace Prize winner and then-Secretary-General of the United Nations, at a global summit in Tunis.
The part that bears repeating is that Nicholas' constructionism-based computer learning project in Senegal was a complete disaster: modulo commentary on the personalities and egos involved, it demonstrated nothing about anything. And Krisher's Cambodia project, the one evidently successful enough to motivate Nicholas to actually start OLPC, used off-the-shelf laptops running Windows without any constructivist customizations of the OS whatsoever. (They did have some constructionist tools installed as regular applications.)
What we know
The truth is, when it comes to large-scale one-to-one computing programs, we're completely in the dark about what actually works, because hey, no one has done a large-scale one-to-one computing program before. Mako Hill writes:
We know that laptop recipients will benefit from being able to fix, improve, and translate the software on their laptops into their own languages and contexts. ... We can help foster a world where technology is under the control of its users, and where learning is under the terms of its students — a world where every laptop owner has freedom through control over the technology they use to communicate, collaborate, create, and learn. It is the reason that OLPC's embrace of constructionist philosophy is so deeply important to its mission and the reason that its mission needs to continue to be executed with free and open source software. It is why OLPC needs to be uncompromising about software freedom.
This kind of bright-eyed idealism is appealing, but alas, just not backed by fact. No, we don't know that laptop recipients will benefit from fixing software on their laptops. Indeed, I bet they'd largely prefer the damn software works and doesn't need fixing. While we think and even hope that constructionist principles, as embodied in the free software culture, are helpful to education, presenting the hopes as rooted in fact is simply deceitful.
As far as I know, there is no real study anywhere that demonstrates constructionism works at scale. There is no documented moderate-scale constructionist learning pilot that has been convincingly successful; when Nicholas points to "decades of work by Seymour Papert, Alan Kay, and Jean Piaget", he's talking about theory. He likes to mention Dakar, but doesn't like to mention how that pilot ended — or that no real facts about the validity of the approach came out of it. And there sure as hell doesn't exist a peer-reviewed study (or any other kind, to my knowledge) showing free software does any better than proprietary software when it comes to aiding learning, or that children prefer the openness, or that they care about software freedom one bit.
Keeping that in mind, Richard Stallman's missive on the subject just riled me up:
Proprietary software keeps users divided and helpless. Its functioning is secret, so it is incompatible with the spirit of learning. Teaching children to use a proprietary (non-free) system such as Windows does not make the world a better place, because it puts them under the power of the system's developer — perhaps permanently. You might as well introduce the children to an addictive drug.
Oh, for fuck's sake. You really just employed a simile comparing a proprietary OS to addictive drugs? You know, ones causing actual bodily harm and possibly death? Really, Stallman? Really?
If proprietary software is half as good as free software at aiding children's learning, you're damn right it makes the world a better place to get the software out to children. Hell, if it doesn't actively inhibit learning, it makes the world a better place. The problem is that Stallman doesn't appear to actually give an acrobatic shit about learning, and sees OLPC as a vehicle for furthering his political agenda. It's shameful, the lot of it.
While we're on the subject
One of the favorite arguments of the free software and open source community for the obvious superiority of such software over proprietary alternatives is the user's supposed ability to take control and modify inadequate software to suit their wishes. Expectedly, the argument has been often repeated in relation to OLPC.
I can't possibly be the only one seeing that the emperor has no clothes.
I started using Linux in '95, before most of today's Internet-using general public knew there existed an OS outside of Windows. It took a week to configure X to work with my graphics card, and I learned serious programming because I later needed to add support for a SCSI hard drive that wasn't recognized properly. (Not knowing that C and kernel hacking are supposed to be "hard", I kept at it for three months until I learned enough to write a patch that works.) I've been primarily a UNIX user since then, alternating between Debian, FreeBSD and later Ubuntu, and recently co-writing a best-selling Linux book.
About eight months ago, when I caught myself fighting yet another battle with suspend/resume on my Linux-running laptop, I got so furious that I went to the nearest Apple store and bought a MacBook. After 12 years of almost exclusive use of free software, I switched to Mac OS X. And you know, shitty power management and many other hassles aren't Linux's fault. The fault lies with needlessly secretive vendors not releasing documentation that would make it possible for Linux to play well with their hardware. But until the day comes when hardware vendors and free software developers find themselves holding hands and spontaneously bursting into one giant orgiastic Kumbaya, that's the world we live in. So in the meantime, I switched to OS X and find it to be an overwhelmingly more enjoyable computing experience. I still have my free software UNIX shell, my free software programming language, my free software ports system, my free software editor, and I run a bunch of free software Linux virtual machines. The vast, near-total majority of computer users aren't programmers. Of the programmers, a vast, near-total majority don't dare in the Land o' Kernel tread. As one of the people who actually can hack my kernel to suit, I find that I don't miss the ability in the least. There, I said it. Hang me for treason.
My theory is that technical people, especially when younger, get a particular thrill out of dicking around with their software. Much like case modders, these folks see it as a badge of honor that they spent countless hours compiling and configuring their software to oblivion. Hey, I was there too. And the older I get, the more I want things to work out of the box. Ubuntu is getting better at delivering that experience for novice users. Serious power users seem to find that OS X is unrivaled at it.
I used to think that there was something wrong with me for thinking this. Then I started looking at the mail headers on mailing lists where I hang out, curious about what other folks I respect were using. It looks like most of the luminaries in the security community, one of the most hardcore technical communities on the planet, use OS X.
And lest you think this is some kind of Apple-paid rant, I'll mention Mitch Bradley. Have you read the story of Mel, the "real" programmer? Mitch is that guy, in 2008. Firmware superhacker, author of the IEEE Open Firmware standard, wrote the firmware that Sun shipped on its machines for a good couple of decades, and in general one of the few people I've ever had the pleasure of working with whose technical competence so inordinately exceeds mine that I feel I wouldn't even know how to start catching up. Mitch's primary laptop runs Windows.
Sleight of hand
But really, I digress. The point is that OLPC was supposed to be about learning, not free software. And the most upsetting part of the Windows announcement is not that it exposed the actual agendas of a number of project participants which had nothing to do with learning, but that Nicholas' misdirection and sleight of hand were allowed to stand.
The whole "we're investing into Sugar, it'll just run on Windows" gambit is sheer nonsense. Nicholas knows quite well that Sugar won't magically become better simply by virtue of running on Windows rather than Linux. In reality, Nicholas wants to ship plain XP desktops. He's told me so. That he might possibly fund a Sugar effort to the side and pay lip service to the notion of its "availability" as an option to purchasing countries is at best a tepid effort to avert a PR disaster.
In fact, I quit when Nicholas told me — and not just me — that learning was never part of the mission. The mission was, in his mind, always getting as many laptops as possible out there; to say anything about learning would be presumptuous, and so he doesn't want OLPC to have a software team, a hardware team, or a deployment team going forward.
Yeah, I'm not sure what that leaves either.
There are three key problems in one-to-one computer programs: choosing a suitable device, getting it to children, and using it to create sustainable learning and teaching experiences. They're listed in order of exponentially increasing difficulty.
The industry didn't want to tackle the first one because there was little profit in it. OLPC successfully made them do it in the most effective way possible: by threatening to steal their lunch. But industry laptop manufacturers still don't want to tackle deployment, because it's really, really fucking hard, isn't within a 100-mile radius of their core competency, and generally has a commercial ROI that makes baby Cthulhu cry.
Peru's first deployment module consisted of 40 thousand laptops, to be deployed in about 570 schools across jungles, mountains, plains, and with total variance in electrical availability and uniformly no existing network infrastructure. A number of the target schools are in places requiring multiple modes of transportation to reach, and that are so remote that they're not even serviced by the postal service. Laptop delivery was going to be performed by untrusted vendors who are in a position to steal the machines en masse. There is no easy way to collect manifests of what actually got delivered, where, and to whom. It's not clear how to establish a procedure for dealing with malfunctioning units, or those dead on arrival. Compared to dealing with this, the technical work I do is vacation.
Other than the incredible Carla Gomez-Monroy who worked on setting up the pilots, there was no one hired to work on deployment while I was at OLPC, with Uruguay's and Peru's combined 360,000 laptop rollout in progress. I was parachuted in as the sole OLPC person to deal with Uruguay, and sent to Peru at the last minute. And I'm really good at thinking on my feet, but what the shit do I know about deployment? Right around that time, Walter was demoted and theoretically made the "director of deployment," a position where he directed his expansive team of — himself. Then he left, and get this: now the company has half a million laptops in the wild, with no one even pretending to be officially in charge of deployment. "I quit," Walter told me on the phone after leaving, "because I can't continue to work on a lie."
It's not like OLPC was caught unawares, or somehow forgot that this was going to be an issue. I wrote in an internal memo in December:
We have multiple concurrent rollouts of differing scale in progress — Uruguay with eight thousand machines, G1G1 with potentially a quarter million — and with at least Peru and Mongolia on the horizon within a month from now. We have no real support infrastructure for these rollouts, our development process is not allocating any time for dealing with critical deployment issues that (will inevitably) come up, and we have no process for managing the crises that will ensue. I wish I could say this is the bulk of our problems, but I mention these first simply because I predict it's these deployments that will impose the heaviest burden on this organization in the coming months — a burden we're presently entirely unprepared to handle.
...
We still have not a single employee focusing on deployment, helping to plan it, working with our target countries to learn what works and what doesn't. Evidently our "deployment plan" is to send whichever hotshot superhacker we have available to each country such that he may fix any problems that arise on the spot. If that is not in fact our plan, then we have no plan at all.
That OLPC was never serious about solving deployment, and that it seems to no longer be interested in even trying, is criminal. Left uncorrected, it will turn the project into a historical information technology fuckup unparalleled in scale.
As for the last key problem, transforming laptops into learning is a non-trivial leap of logic, and one that remains inadequately explained. No, we don't know that it'll work, especially not without teachers. And that's okay — the way to find out whether it works might well be by trying. Sometimes you have to run before you can walk, yeah? But most of us who joined OLPC believed that the educational ideology behind the project is what actually set it apart from similar endeavors in the past. Learning which is open, collaborative, shared, and exploratory — we thought that's what could make OLPC work. Because people have tried plain laptop learning projects in the past, and as the New York Times noted on its front page not so long ago, they crashed and burned.
Nicholas' new OLPC is dropping those pesky education goals from the mission and turning itself into a 50-person nonprofit laptop manufacturer, competing with Lenovo, Dell, Apple, Asus, HP and Intel on their home turf, and by using the one strategy we know doesn't work. But hey, I guess they'll sell more laptops that way.
Broken windows theory
I've tried to establish already that there's no evidence that free software provides a superior learning experience when compared to a proprietary operating system. This point bears some elaboration. Bernie Innocenti, until recently the CTO for the fledgling OLPC Europe, a few days ago wrote:
I myself wouldn't oppose a Windows port of Sugar. I would never waste my time on it, or encourage anyone to waste their time on it, but it's free software and thus anyone is free to port it to anything they wish.
Stallman similarly called a Windows port of Sugar "not a good thing to do". Here's the thing: such a port is only a waste of time if free software is not the means here, but an end. At Nicholas' solicitation, I wrote an internal memo on software strategy in early March. It was co-signed by Marco Pesenti Gritti, the inimitable Sugar team lead. I am not at liberty to reproduce the entire document, but I will quote the most relevant section with minimal redactions:
... We [argue strongly that we should] decouple the Sugar UI from the Sugar technologies we’ve developed such as sharing, collaboration, the presence service, the data store, and so forth. We may then make those services run well in a regular Linux desktop environment and redefine the Sugar activity concept to simply be any Linux desktop application capable of using the Sugar services. The Sugar UI itself could, optionally and at a later date, be provided as a graphical launcher, perhaps developed by the community.
The core mistake of the present Sugar approach is that it couples phenomenally powerful ideas about learning — that it should be shared, collaborative, peer to peer, and open — with the notion that these ideas must come presented in an entirely new graphical paradigm. We reject this coupling as untenable.
Choosing to reinvent the desktop UI paradigm means we are spending our extremely overconstrained resources fighting graphical interfaces, not developing better tools for learning. … It is most important to recognize that the graphical paradigm changes are inessential both to our core mission and to the Sugar core ideas.
We gain a plethora of benefits from detaching the technologies that directly support the mode of learning we care about from the Sugar UI. Notably, it becomes far easier to spread these ideas and technologies across platforms — our UI components are the hardest parts to port. If the underlying Sugar technologies were made easily available for all major OSes, we could leverage the creativity and work of the wider development community to build applications on top of our core offerings, creating a diverse ecosystem of powerful learning tools. Those tools could then be used by learners globally and on any computer, XO or otherwise. This should have been our aim all along. Many of the technologies we’ve built would be welcomed with arms wide open into modern Linux desktops, and a large number of developers would likely get engaged with them if we provided the possibility. In contrast to the current situation, such a model must be the direction where we take things: OLPC benevolently steering development which is mostly done by the community.
Finally, with regard to the politically-sensitive question of OLPC’s commitment to open source, we think there is a simple and uncomplicated answer: OLPC’s policy should be that all OLPC-developed software is open source and uses open standards and open formats. We don’t think a stronger commitment is necessary. Our preference for open source should stem solely from the conviction that it provides a better learning environment than closed-source alternatives. As such, having an open source cross-platform set of core technologies for building collaborative learning applications makes a tremendous amount of sense. But fundamentally, requiring that a particular UI or even OS are used seems entirely superfluous; we should be satisfied with any environment where our core technologies can be used as building blocks for delivering the learning experience we care so strongly about.
At the end of the day, it just doesn’t matter to the educational mission what kernel is running Sugar. If Sugar itself remains open and free — which, thus far, has never been in question — all of the relevant functionality such as the ‘view source’ key remains operational, on Windows or not. OLPC should never take steps to willingly limit the audience for its learning software. Windows is the most widely used operating system in existence. A Windows-compatible Sugar would bring its rich learning vision to potentially tens or hundreds of millions of children all over the world whose parents already own a Windows computer, be it laptop or desktop. To suggest this is a bad course of action because it’s philosophically impure is downright evil.
And hey, maybe a Windows version of Sugar gets kids sufficiently interested in computer innards to actually want to switch to Linux. Trolltech, the company behind the Qt graphical toolkit, was recently purchased by Nokia and announced it’ll be adding platform support for the mobile version of Windows, apparently to accusations of treason in the free software community. But Trolltech’s CTO Benoit Schillings doesn’t think that’s right:
Some critics are concerned that Trolltech’s support for Windows Mobile could limit the growth of mobile and embedded Linux technologies, but Schillings sees things differently. By enabling application developers to create a single code base that can seamlessly move across platforms, he says that Trolltech is making it easier for companies that are currently using Windows Mobile to transition to Linux, which he thinks will mean more adoption of the open source operating system in the long run.
The man speaks wisely.
Now, pay close attention: while I’m unequivocally enthusiastic about Sugar being ported to every OS out there, I’m absolutely opposed to Windows as the single OS that OLPC offers for the XO. The two matters are completely orthogonal, and Nicholas’ attempt to conflate them by calling the open source community “fundamentalists” (and watching the community foam at the mouth instead of picking apart his logic) is just another bit of misdirection. Not that anyone should really feel offended, since he’s made it a habit to call his employees terrorists.
OLPC should be philosophically pure about its own machines. Being a non-profit that leverages goodwill from a tremendous number of community volunteers for its success and whose core mission is one of social betterment, it has a great deal of social responsibility. It should not become a vehicle for creating economic incentives for a particular vendor. It should not believe the nonsense about Windows being a requirement for business after the children grow up. Windows is a requirement because enough people grew up with it, not the other way around. If OLPC made a billion people grow up with Linux, Linux would be just dandy for business. And OLPC shouldn’t make its sole OS one that cripples the very hardware that supposedly set the project’s laptops apart: released versions of Windows can neither make good use of the XO power management, nor its full mesh or advanced display capabilities.
Most importantly, the OS that OLPC ships should be one that embodies the culture of learning that OLPC adheres to. The culture of open inquiry, diverse cooperative work, of freely doing and debugging — this is important. OLPC has a responsibility to spread the culture of freedom and ideas that support its educational mission; that cannot be done by only offering a proprietary operating system for the laptops.
Put differently, OLPC can’t claim to be preoccupied with learning and not with training children to be office computer drones, while at the same time being coerced by hollow office drone rhetoric to deploy the computers with office drone software. Nicholas used to say the thought of the XOs being used to teach 6-year olds Word and Excel made him cringe. Apparently, no longer so. Which is it? The vacillation needs to stop. As they say in the motherland: shit or get off the pot.
How to go forward
Here’s a paragraph from one of my last e-mails to Nicholas, sent shortly after I resigned:
I continue to think it’s a crying shame you’re not taking advantage of how OLPC is positioned. Now that it’s goaded the industry into working on low-cost laptops, OLPC could become a focus point for advocating constructionism, making educational content available, providing learning software, and keeping track of worldwide [one-to-one] deployments and the lessons arising from them. When a country chooses to do [a one-to-one computer program], OLPC could be the one-stop shop that actually works with them to make it happen, regardless of which laptop manufacturer is chosen, banking on the deployment plans it’s cultivated from experience and the readily available base of software and content it keeps. In other words, OLPC could be the IBM Global Services of one-to-one laptop programs. This, I maintain, is the right way to go forward.
I’m trying to convince Walter not to start a Sugar Foundation, but an Open Learning Foundation. For those who still care about learning in this whole clusterfuck of conflicting agendas, the charge should be to start that organization, since OLPC doesn’t want to be it. Having a company that is device-agnostic and focuses entirely on the learning ecosystem, from deployment to content to Sugar, is not only what I think is sorely needed to really take the one-to-one computer efforts to the next level, but also an approach that has a good chance of making the organization doing the work self-sustaining at some point.
So here’s to open learning, to free software, to strength of personal conviction, and to having enough damn humility to remember that the goal is bringing learning to a billion children across the globe. The billion waiting for us to put our idiotic trifles aside, end our endless yapping, and get to it already.
Let’s get to it already.
My thanks to Walter Bender and Marco P. Gritti for reading drafts of this essay.
Python Secret Weblog
In Memoriam Joachim Schmitz
This is a rather unusual blog entry. I sincerely hope I don't have to write blog entries like this very often. I usually write about technology, and the community surrounding it. This community is, of course, composed of people. Yesterday I heard that Joachim Schmitz, long-standing member of the Zope and Python communities, had died suddenly last weekend. He was a regular presence at Zope-related events, and I have met him often. I feel that his passing should not go without notice for people in our community who knew him and met him.
I last saw Joachim just a week before he died, at the Grokkerdam sprint here in Rotterdam. Here is a picture of the sprint, winding down on a Saturday night, with people working and playing. Joachim is in the bottom right corner, looking pensively at the laptop in front of us, while I'm explaining some code to him. It's Joachim as I knew him, in his element, surrounded by fellow enthusiasts.
I wrote the following message to his friends and family, and I want to share it with everybody. I wrote it for myself, as someone who knew Joachim and will miss him, and also for our community.
I know Joachim as a member of the Zope community. I want to share a little about what Joachim meant to us in the Zope community.
Zope is software. I hear that he talked a lot about Zope in his private life, so you might have heard of it. He talked about it even to people that don't know much about software. I understand this perfectly, as I have the same habit. Zope, to him, like to me, is more than just a boring tool you use in your work to build web applications. It's something you can be passionate about, like a craftsman can be about his tools. It's also a community: people you know, who you work with, and who you like. It's a community of people that have known each other for years.
I remember when I met Joachim for the first time. It was at the International Python Conference in Washington DC, in the United States, in late january in the year 2000. It feels very long ago to me now; so much has happened since then. I sat next to him during some talks about Zope. I think he had come to the conference for the same reason I had: to meet other people who work with Zope. Joachim was friendly, and we started talking to each other.
We didn't know then yet, at this first meeting, that we would see each other again on many occassions afterwards, in the years following. His friends and relatives must have noticed Joachim was frequently away at one Zope related event or another. He wasn't away. It's when he was with us! He became a regular face at Zope-related events. I saw him several times each year following the first meeting. He was there so often and so reliably that we joked with him: you again! you're always there!
It isn't a proper Zope event without Joachim there.
Joachim contributed to our community in many ways. One example is his hard work for several years to make sure that people could sign up and pay for the EuroPython conference, helping to make this conference a success.
I saw Joachim for the last time only a little over a week ago as I write this. He was participating in the Grokkerdam sprint, here in Rotterdam. It was yet another Zope-related event: a proper one, as Joachim was there! I sat together with Joachim for a while, working with him and talking to him. Joachim was always interested, always learning, always participating. We were talking about how to teach Zope a few new tricks, planning for the future... See you next time, we told each other, when he left for home.
A week later he was gone, suddenly. He was one of us, and well-liked. This is why I feel so sad now, knowing he passed away. I'm glad I got to see him that one last time.
We who knew him will miss him. Zope events, and EuroPython, won't be the same without him. We will have to get used to him not being there in person with us. But we will remember him. By thinking of him, he is not entirely lost to us. This way, we will still feel his friendly presence in the future.
Martijn Faassen, Rotterdam, the Netherlands
Chairman of the Zope Foundation
Greg Wilson
On My Way to Texas
I’m flying down to Austin this afternoon (I know, I know, bad carbon karma), where I’ll be talking to the Austin Python Users’ Group about Beautiful Code, and at the Texas Advanced Computing Center’s Scientific Software Days about “HPC Considered Harmful”. I’m looking forward to meeting everyone!
What I’m Reading These Days
A couple of students have asked, so here’s my reading list:
- “ACM Queue”, “Communications of the ACM”, “IEEE Software”, “IEEE Computer”: all are magazines, rather than peer-reviewed research journals; I flip through each one when I find it just to see if there’s anything of interest. Good for broad, high-level overview of what everyone else is thinking about; I guess I read about 0.5 articles per magazine, and spend no more than 2-3 minutes flipping through them on average.
- “Empirical Software Engineering”, “IEEE Transactions on Software Engineering”, “Automated Software Engineering”, “ACM Transactions on Software Engineering and Methodology”, and a few others: the specialized peer-reviewed journals of record in my area. Very low hit rate these days (maybe one article in ten), partly because they cover the whole of software engineering, and partly because most of the things being discussed seem to have little to do with real-world software development as I’ve experienced it.
- “Discover”, “American Scientist”, and “New Scientist”: these are for fun (yeah, I’m a science geek); I have a couple to take with me on the plane to Texas this afternoon. (I’m particularly fond of Brian Hayes’ column in “American Scientist”…)
- “Computing in Science & Engineering”: figuring out how to make scientific programmers more productive is the main reason I’m in academia (see https://swc.scipy.org for my current best guess). I’m on the editorial board of this magazine, and I’d guess I read about 1/4 of the articles end to end.
- “Doctor Dobb’s Journal“: has been talking to professional software developers since the late 1970s. Most of my book reviews appear here, and I find two or three articles in each issue worth reading from end to end. A lot of what I know about real-world technologies I pick up here.
- “Software: Practice & Experience” and “The Journal of Systems & Software”: in-depth descriptions (and critiques) of real software systems (which is what I thought software engineering would mostly be about, back in my naive and idealistic youth). The first description of “Make” appeared in “SP&E” way back in 1975, and a recent issue of “JSS” described a dozen different systems for tracking the provenance of scientific data. High hit rate…
- SIGCSE: is the Special Interest Group on Computer Science Education. They have an annual conference, and I go through the proceedings article by article every year (high hit rate). I’ve also started reading the proceedings from ITiCSE and CSEE&T, which are (respectively) a European equivalent to SIGCSE and a conference on software engineering education and training.
- Adam Goucher’s blog: best tester I ever worked with, now thinking about what QA really ought to be about. I also enjoy the Google Testing Blog.
- The Beautiful Code blog (mostly written by Michael Feathers, author of one of my favorite books): lots of good thoughts on software system design.
- The Computer Science Canada blog: student-run, student-written, interesting viewpoint on the world (always looking for contributions, by the way).
- The DemoCamp blog: DemoCamp is the equivalent of open mike night at the pub; tech people from small companies and startups in the Toronto area get up and give lightning demos and talks about what they’re doing. Since David Crow founded it two and a half years ago, it has spread to more than a dozen other cities.
- Joel Spolsky, Jon Udell, and Bruce Schneier: the first is more often amusing than deep, the second reminds me of of John McPhee’s essays, and the third is frankly scary.
- XKCD: the ironic person’s Dilbert.
You can also check out my recommended reading list (slightly out of date — see my LibraryThing page for a more complete list, mostly sans review).
David Ascher Has Nice Things To Say…
…about the work that Mike Wu and Ronald Fung did for Thunderbird last term. Feedback from other clients was equally positive; I’m hoping/looking forward to running the course again next year.
A Different Perspective
Interesting post from Jorge Aranda, reporting a talk from 1969 that still resonates.
IPython0 blog
Greedy completer
Every now and again, we at IPython0 get complaints about the strict criteria that we use for tab completing python attributes; basically, we only tab complete expressions where side effects are not likely when calling ‘eval’, as in expression foo.bar.ba<TAB> (we eval foo.bar, then get attributes from the resulting object). An expression we do not [...]
Ted Leung
JavaOne 2008: Part 2
I’ve been to so many conferences and seen so many talks that it’s hard for me to really get excited about conference presentations. I went to talks here and there, but nothing at JavaOne was really reaching out at grabbing me (in fairness, this happens at other conferences also, so it’s not just JavaOne). Or at least that was true until the last day.
Friday opened with a keynote by James Gosling, who served as the MC for a train of presenters on various cool projects.
Cool stuff
First up was Tor Norbye, who has done a lot of good work on support for editing different languages in NetBeans. Tor has been working on JavaScript support for NetBeans 6.1, and he showed off some cool features, like detecting all the exits from a function, semantic highlighting of variables, and integrated debugging between NetBeans and Firefox. All of which was cool. When I was managing the Cosmo group at OSAF, I tried a bunch of Javascript IDE’s and never really liked any of them. I haven’t done a lot with NetBeans 6.1 yet, but I will. Tor showed one feature, which was the killer one for me. NetBeans knows what Javascript will work in which browser. You can configure the IDE for the browsers that you want to support, and this affects code completion, quick fix checking and so on. Definitely useful. Here are several more references on the Javascript support in NetBeans 6.1.
The Java Platform
It’s easy for me (and others, I’d bet) to think mostly of JavaEE or perhaps JavaME when thinking about Java. That’s understandable given the worlds fixation on web applications, and looking ahead to mobile. But the majority of the talks in Gosling’s keynote session had nothing to do with Java SE, EE, or ME (at least in the phone sense).
Probably the hit (applause meter wise) of the keynote was LiveScribe’s demonstration of their Pulse Smart Pen. This is an interesting pen that records the ink strokes that it makes, and any ambient audio that it records while the writing is happening. The ink and audio can be uploaded to a computer, as long as that computer runs Windows (apparently a Mac version is in the works). Unfortunately, the pen works by sensing marks on a special paper (that would be the razor blades), so there’s a limitation on how useful this can be. The presenter said that a future version of the software would allow people to print their own special paper, but that’s still a future item for now. By reading special marks on the special paper, you get a pretty cool user interface. The pen itself can run Java programs, and there is a developer kit available for it. If they can get by the limitation of special paper, I think that this is going to be pretty interesting.
Sentilla showed off their Mote hardware, which seem like RFID chips that can run Java programs. except that these RFID chips can form mesh networks amongst themselves and can have various kinds of sensors attached. There are lots of applications for these things, going well beyond inventory tracking and such.
Sun Distinguished Engineer Greg Bollella demonstrated Blue Wonder, which is a replacement for the computers used to control factories. Blue Wonder combines off the shelf x86 hardware, Solaris, and real time Java to provide a commodity solution for factory control applications. This is far afield of Web 2.0 applications, but just as cool, in my mind.
By the end of the keynote I was reminded of the long reach of the JVM platform, something that I’d lost sight of. The latest craze in the Web 2.0 space is location data — O’Reilly has an entire conference devoted to the topic. I think that sensor fusion of various kinds (not just location sensors) is going to play a big role in the next generation of really interesting applications. The JVM looks like it’s going to be a part of that. I don’t think than any other virtual machine technology is close in this regard.
Java’s future
I also went to a talk on Maxine, a meta-circular JVM. By the twitter reactions of the JRuby and Jython committers, I’d say that Maxine is going to get some well deserved attention when it is open sourced in June. I’m particularly interested because the PI’s for Maxine worked on PJava, and MVM. Given the differences between the Erlang VM and the JVM, I think that the ability to experiment with MVM is going to be pretty interesting. Apparently, there’s already some form of MVM support in Maxine - we’ll find out for sure in June.
During the conference I had a meeting with Cay Horstmann, and at the end of the meeting Josh Bloch saw Cay and wanted to talk to him about the BGGA closures proposal for Java. Turns out that Josh has an entire slide deck which consists of a stream of examples where BGGA does the wrong thing, generates really cryptic error messages, or requires an unbelievable amount of code. The fact that BGGA depends on generics, which are already really hard, doesn’t give me much confidence about closures in Java. If you are a statically typed language fan, I think that you ought to be worried about whether Java, the language, has any headroom left.
The last session that I went to was Cliff Click and Brian Goetz’s session on concurrency. Unsurprisingly, the summary of the talk is “abandon all hope, ye who enter here”. I was glad to see a section in the talk about hardware support/changes for concurrency. The problem is that concurrency is going to introduce end-to-end problems, from the hardware all the way up to the application level, and I think that every stop along the way is going to be affected. Unlike sequential programming, where we are still largely reinventing the wheels of the past, there is no real previous history of research results to be mined for concurrency. Hotspot and other VM’s are close to implementing most of the tricks learned from Smalltalk and Lisp, but those systems were mostly used in a sequential fashion, and while there were experiments with concurrency, there was much less experience with the concurrent systems than the sequential ones. Big challenges ahead.
JavaOne 2008: Part 1
JavaOne is a pretty intense experience, simply by virtue of the size. If CommunityOne was twice the size of OSCON, then JavaOne is three times the size of OSCON, and it shows . There was an immediate change in feel and atmosphere once JavaOne got into full swing. You could barely move sometimes, and there were a bunch of people whose job was to corral the crowds into some semblance of order.

As a Sun employee, I was on a restricted badge, which made it hard to get into sessions (you are basically flying standby). On the other hand, I had plenty to do. I participated in a dynamic languages panel for press and analysts (who have their own track), which was pretty fun. The discussion was lively enough that we could have gone for another hour. There was one persistent fellow who really wanted there to be just one language, or wanted us to declare language X better for task Y. When I got started in computing, people learned and worked in several languages. Its only been recently that a language (Java) was popular enough that people could just learn one language, and the growth of web applications pretty much guarantees a multi-language future because of server side and client side differences. In the end, we’re back to finding and using the best tool for the job, or at least the most comfortable tool for the job. This is probably going to cause heartburn for big IT shops, but developers seem to be happy about it.
I took a walk through the Java Pavilion with Tim Bray one afternoon. He got into the AMD booth’s aromatherapy display (and yes, he has a similar shot of me doing the same thing). One of the highlights of that excursion was Tim introducing me to Dan Ingalls, who made a number of very substantial contributions to Smalltalk, including its original VM and the BitBlt graphics operation. I am a great admirer of the work that was done in Smalltalk, and it was an honor to meet Dan and have him explain the Lively Kernel to me. A short (and probably not quite fair) description of the Lively Kernel is to take the lessons learned from Smalltalk/Squeak and implement them in the browser using Javascript, AJAX, and SVG.
Unsurprisingly, I got the most value at JavaOne from the networking. And that means dinners, hallway conversations, and yes, the parties. Usually when I go to conferences, I am just a party attender. This time, I also worked at some of the parties. It was a little different to walk around the SDN party wearing a t-shirt with “SDN Event Staff” painted large on the back. I still had a good time. Between the T-shirt and the camera, I definitely had some good conversations.
Another benefit of being at a huge is company is that they can really throw a big party. Like hiring Smash Mouth to play for a private concert:
I’ve uploaded the rest of my photos from the conference to this Flickr set.
I actually do have some technical commentary, but I am going to put that into another post.
Jesse Noller
The damning of the OLPC: Sic Transit Gloria Laptopi
Ivan Krstić just posted an entry named "Sic Transit Gloria Laptopi".
Without commenting on the problems of the OLPC project, which - for some time - has seemed to be rapidly pushing itself into oblivion - I completely agree on Ivan's points on open source and frankly, everything else he says.
It's saddening that the project which thrilled me - due to the ideas outline by Ivan - now disgusts me and so many others.
Mark Ramm-Christensen
Threads, Processes, Rails, TurboGears, and Scalability
Threads may not be be best way, or the only way, to scale out your code. Multi-process solutions seem more and more attractive to me.
Unfortunately multi-process and the JVM are currently two tastes that don’t taste great together. You can do it, but it’s not the kind of thing you want to do too much. So, the Jruby guys had a problem — Rail’s scalability story is only multi-process (rails core is NOT thread safe), and Java’s not so good that that….
Solution: Running “multiple isolated execution environments” in a single java process.
I think that’s a neat hack. The JRuby team is to be congratulated in making this work. It lets Rails mix multi-process concurrency with multi-threaded concurrency, if only on the JVM. But it’s likely to incur some memory bloat, so it’s probably not as good as it would be if Rails itself were to become threadsafe.

I’m not sure that the Jython folks have done anything like this. And I’m not sure they should. It’s a solution python folks don’t really have. Django used to have some thread-safety issues, but those have been worked out on some level. While the Django people aren’t promising anything about thread safety, it seems that there are enough people using it in a multi-threaded environment to notice if anything’s not working right.
At the same time, TurboGears has been threadsafe, from the beginning, as has Pylons, Zope, and many other python web dev tools. The point is, you have good web-framework options, without resorting to multiple python environments in one JVM.
Why you actually want multi-threaded execution…
In TurboGears we’ve found that the combination of both multi-threaded and multi-process concurrency works significantly better than either one would alone. This allows us to use threads to maximize the throughput of one process up to the point where python’s interpreter lock becomes the bottleneck, and use multi-processing to scale beyond that point, and to provide additional system redundancy.
A multi threaded system is particularly important for people who use Windows, which makes multi-process computing much more memory intensive than it needs to be. As my Grandma always said Windows “can’t fork worth a damn.” ;)
But, given how hard multi-threaded computing can be to get right TurboGears and related projects work hard to keep our threads isolated and not manipulate any shared resources across threads. So, really it’s kinda like shared-memory optimized micro-processes running inside larger OS level processes, and that makes multi-threaded applications a lot more reasonable to wrap your brain around. Once you start down the path of lock managment the non-deterministic character of the system can quickly overwhelm your brain.
As far as i can see, the same would be true for a Ruby web server in Ruby 1.9, where there is both OS level thread support and an interpreter lock.
I’m well aware of the fact that stackless, twisted, and Nginx have proved that there are other (asynchronous) methods that can easily outperform the multi-threaded+multi-process model throughput/concurrency per unit of server hardware. The async model requires thinking about the problem space pretty differently, so it’s not a drop in replacement, but for some problems async is definitely the way to go.
Anyway, hats off to the Jruby team, and here’s hoping that Rails itself becomes threadsafe at some point in the future.
Philip Jenvey
Jython @ JavaOne 2008
I attended a couple days of JavaOne last week and luckily avoided the norovirus outbreak. However I did notice something else spreading through Moscone Center; interest in languages on the JVM. For example:
- The number of sessions/BoFs covering Dynamic Language topics: around 8 on Groovy, 5 on JRuby, 2 on Jython and also a couple on Scala and Rhino each, with some very good turnouts. That's not even including CommunityOne. Also featured was a Scripting language bowl: a faceoff between JRuby, Groovy, Scala and Jython.
- The JavaOne book store was stocked with just about every JRuby and Groovy book out there. Of the top 10 selling books at JavaOne, 3 were about languages on top of the JVM (Groovy and JavaFX). Wednesday's top 10 sellers also included JRuby committer Ola Bini's Practical JRuby on Rails as well as another Rails book. I'm hoping to see a whole lot of Python books on the shelf next year.
- A number of Ruby/JRuby folks (and not just the ones employed by Sun either) told me that they are quite happy with NetBeans' support for Ruby. It has code completion and even some refactoring support. Tor Norbye and the NetBeans crew are now working on JavaScript support (Look Mom, JavaScript type inference) and are slated to add Python support next. Ted Leung has already begun talking to them about the details.
- Ruby is still on the rise, and JRuby is a big contributing factor. JRuby definitely had the attention of many attendees and definitely has a growing userbase in Java land.
- Groovy and the Groovy on Grails combo is also on the rise. linkedin.com, Sky TV and SAP's Composition on Grails Product are a few notable users of Grails. IBM's Project Zero (aka the WebSphere sMash product) is also utilizing Groovy, as well as their own PHP on the JVM implementation.
- As for a language he'd use *now* on top of the JVM, except Java, James Gosling endorses Scala.
- Sun's own new scripting lanuguage on top of the JVM, JavaFX, was of course all over the place.
- John Rose: "I think we are on the right track here, letting the JVM grow independently of the Java language. (The language, if it has room to grow, will catch up.) James Gosling expressed a similar sentiment at his February Java Users Group talk “The Feel of Java, Revisited”, when he said he sometimes felt more interested in the future of the JVM than that of the Java language. “I don’t really care about the Java language. All the magic is in the JVM specification.” (Yes, I think that is hyperbolic. No, he is not abandoning Java.) I love both Java and the JVM, and I am pushing on the JVM this year."
- The potential Java 7 new features. They're still potential because the Java 7 JSR isn't out yet. While some are nice (like John Rose's JSR 292, which will be a big help for JVM languages like Jython), some are arguably not nice. I hear more disdain than ever over where Java the language is going, which makes other languages on top of the JVM even more desirable.
- I met many new people (more than I expected) who have used Jython at some point in their career, and some who are using it now (like pushToTest's Frank Cohen). They're all eager about its recent progress and can't wait for the 2.5 release.
- Reminder: the Java world is large. Over 10,000 attendees, probably 10 times as many as this year's PyCon. All potential Python converts, right?
May 13, 2008
Doug Napoleone
The Hague Decloration
Andy Updegrove has just posted about The Hague Decloration. I received a phone call about it this morning, and I believe it is one of the most important declarations on human rights to come along in quite some time. Please go read up on this. It may at first appear that technology and the standards those technologies are based on are a very meta-level aspect to human rights as apposed to the men in the night. Recent issues with Google, Yahoo, and Cuba, and South Africa have shown us otherwise. Please read Andy’s post en toto.
Share ThisWhen one thinks of international human rights, one thinks of The Hague - home of the International Court of Justice and the International Criminal Court, and the situs of an increasing number of Tribunals chartered to redress the assaults on human dignity that inexcusably continue to plague this planet. It is therefore appropriate that The Hague has been chosen to witness yet another pronouncement in defense of human rights. That pronouncement has been titled The Hague Declaration by the new international group, called the Digital Standards Organization (”Digistan,” for short), that crafted it. In this blog entry, I’ll talk about what the Declaration is all about, and what it is intended to achieve.
The basic premise is that as more and more of our basic freedoms (speech, assembly, interaction with government, and so on) move from the real to the virtual world, care must be taken to ensure that our ability to exercise these freedoms is not inadvertently eroded or lost. And on the opportunity side, the Internet and the Web provide incredible and unique ways to bring the benefits heretofore enjoyed only in developed countries to those struggling for equality of opportunity in emerging countries.
– Andy Updegrove (Consortium Info Blog) ‘Introducing The Hague Declaration‘
Simon Wittber
I have Unity.
I've just arranged to purchase Unity 3D, a rapid game development tool.
I was first attracted to Unity because it could use the Boo language, which has some similarities to Python, my language of choice. Having played with the trial version, I've realized that Unity is an excellent tool, and will let me focus on games, instead of game libraries and frameworks.
Ironically, my first Unity project is not a game. I'll be using it to simulate industrial robots. Hopefully my client will allow me to post some screen shots as work progresses...
Jesse Noller
What are your favorite nose plugins? How do you run Nose?
So, I am pondering going all-out with Nose, and I am wondering what plugins people find the most useful for it, and also how people are using it.
I see two aspects of nose/any test execution mechanism: Unit testing "native" (i.e: python code) and running tests that are more functional in nature (i.e: not testing python, but instead testing a web interface).
What are the features of nose you found the most useful?
Too bad I couldn't find a decent nose picking graphic for this one.
Ned Batchelder's blog
Blu's Muto
Blu is a street artist from Argentina. He's taken graffiti to a whole new level, creating animations on walls and sidewalks. His latest is Muto which is both a technical tour de force and an eye-opening creepy animation.
Not only did he work in the less than ideal environment of the sidewalk, but it meant that he couldn't have more than one frame in existence at a time, with no possibility of reworking old frames or sketching out new ones. Once the frame was shot, the work was destroyed. Amazing.
Sean McGrath
Coping wirh RSI - a field report
- Debugging, as any software engineer knows, is a tough problem owing to the difficulty of establishing repeatable causal connections between events. Debugging RSI is about the most complex problem I have ever tried to debug I think. -- Coping wirh RSI - a field report
Enthought
Greg Wilson speaking at the Austin Python User Group meeting
Wednesday, May 14th, Greg Wilson will be joining us in Austin for the monthly APUG meeting. He’ll be talking about Beautiful Code. If you’re in the area, swing by Enthought’s Offices right downtown at the corner of 6th and Congress (the Epicenter for Weirdness, as we like to call it). There’s more information at the python.org [...]
Richard Jones' Log: Python
Bruce: how to proceed?
As previously hinted Bruce, the Presentation Tool may now display presentations authored in ReStructuredText. I've had a chance to do some more work on it lately, and need to take a step back and think about what I'm trying to do :)
The ReST capabilities currently available are:
- Sections denote pages (just like all the other ReST presentation tools),
- Lists are handled (some features missing). A lot of inline markup is handled. Images are handled, both inline and stand-alone.
- The stylesheet and other configuration may be changed on the fly with ".. config::" directives.
- The background decoration* may be specified with a ".. decoration::" directive.
*: the decoration stuff is new as of very recently too. Currently it controls the background colour, but also allows rendering of quads (with colour gradient if you like) and images in the background. There's still much to do like scaling the decoration layer to the screen size, and adding more toys to decorate with like lines and possibly splines. Not sure how far to take it.
Missing from the ReST side though is:
- Pages without titles (this would require some sort of "page" directive to indicate a new page has begun).
- Other page types like the Python interpreter, Python code and Video.
- Handling notes and running-sheet HTML generation sensibly.
- Allowing custom page types, perhaps through ".. custom:: <module name>"
Those things aren't insurmountable. I'm becoming increasibly convinced that ReST is a better way to go than the custom markup format, but I'm having trouble with the final decision to give up on the old format.
Of course maintaining two parsers is ... silly.
I'm pretty sure I've made my decision, but thought I'd throw this post out anyway in case anyone had any thoughts or encouragement...
Brian Jones
A Couple of MySQL Performance Tips
If you’re an advanced MySQL person, you might already know these, in which case, please read anyway, because I still have some questions. On the other hand, f you’re someone who launched an application without a lot of database background, thinking “MySQL Just Works”, you’ll eventually figure out that it doesn’t, and in that case, maybe these tips will be of some use. Note that I’m speaking specifically about InnoDB and MyISAM, since this is where most of my experience is. Feel free to add more to this content in the comment area.
InnoDB vs. MyISAM
Which one to use really depends on the application, how you’re deploying MySQL, your plans for growth, and several other things. The very high-level general rule you’ll see touted on the internet is “lots of reads, use MyISAM; lots of writes, use InnoDB”, but this is really an oversimplification. Know your application, and know your data. If all of your writes are *inserts* (as opposed to updates or deletes), MyISAM allows for concurrent inserts, so if you’re already using MyISAM and 90% of your writes are inserts, it’s not necessarily true that InnoDB will be a big win, even if those inserts make up 50% of the database activity
In reality, even knowing your application and your data isn’t enough. You also need to know your system, and how MySQL (and its various engines) use your system’s resources. If you’re using MyISAM, and you’re starting to be squeezed for disk space, I would not recommend moving to InnoDB. InnoDB will tend to take up more space on disk for the same database, and the If you’re squeezed for RAM, I would also not move to InnoDB, because, while clustered indexes are a big win for a lot of application scenarios, it causes data to be stored along with the index, causing it to take up more space in RAM (when it is being cached in RAM).
In short, there are a lot of things to consider before making the final decision. Don’t look to benchmarks for much in the way of help — they’re performed in “lab” environments and do not necessarily model the real world, and almost certainly aren’t modeled after your specific application. That said, reading about benchmarks and what might cause one engine to perform better than another given a certain set of circumstances is a great way to learn, in a generic sort of way, about the engines.
Indexing
Indexes are strongly tied to performance. The wrong indexing strategy can cause straight selects on tables with relatively few rows to take an inordinately long amount of time to complete. The right indexing strategy can help you keep your application ‘up to speed’ even as data grows. But there’s a lot more to the story, and blind navigation through the maze of options when it comes to indexing is likely to result in poorer performance, not better. For example, indexing all of the columns in a table in various schemes all at once is likely to hurt overall performance, but at the same time, depending on the application needs, the size of the table, and the operations that need to be performed on it, there could be an argument for doing just that!You should know that indexes (at least in MySQL) come in two main flavors: clustered, and non-clustered (there are other attributes like ‘hashed’, etc that can be applied to indexes, but let’s keep it simple for now). MyISAM uses non-clustered indexes. This can be good or bad depending on your needs. InnoDB uses clustered indexes, which can also be good or bad depending on your needs.
Non-clustered indexing generally means that the index consists of a key, and a pointer to the data the key represents. I said “generally” - I don’t know the really low-level details of how MySQL deals with its non-clustered indexes, but everything I’ve read leads me to believe it’s not much different from Sybase and MSSQL, which do essentially the same thing. The result of this setup is that doing a query based on an index is still a two-step operation for the database engine: it has to scan the index for the values in the index, and then grab the pointer to get at the data the key represents. If that data is being grabbed from disk (as opposed to memory), then the disk seeks will fall into the category of “random I/O”. In other words, even though the index values are stored in order, the data on disk probably is not. The disk head has to go running around like a chicken without a head trying to grab all of the data.
Clustered indexes, by comparison, kinda rock. Different products do it differently, but the general idea is that the index and the data are stored together, and in order. The good news here is that all of that random I/O you had to go through for sequential range values of the index goes away, because the data is right there, and in the order dictated by the index. Another big win here which can be really dramatic (in my experience) is if you have an index-covered query (a query that can be completely satisfied by data in the index). This results in virtually no I/O, and extremely fast queries, even on tables with a million rows or more. The price you pay for this benefit, though, can be large, depending on your system configuration: in order to keep all of that data together in the index, more memory is required. Since InnoDB used clustered indexes, and MyISAM doesn’t, this is what most people cite as the reason for InnoDB’s larger memory footprint. In my experience, I don’t see anything else to attribute it to myself. Thoughts welcome.
Indexes can be tricky, and for some, it looks like a black art. While I am a fan of touting proper data schema design, and that data wants to be organized independently of the application(s) it serves, I think that once you get to indexing, it is imperative to understand how the application(s) use the data and interact with the database. There isn’t some generic set of rules for indexing that will result in good performance regardless of the application. You also don’t have data integrity issues to concern yourself with when developing an index strategy. One question that arises often enough to warrant further discussion is “hey, this column is indexed, and I’m querying on that column, so why isn’t the index being used?”
The answer is diversity. If you’re running one of those crazy high performance web 2.0 bohemuth web sites, one thing you’ve no doubt tossed around is the idea of sharding your data. This means that, instead of having a table with 400,000,000 rows on one server, you’re going to break up that data along some kind of logical demarcation point in the data to make it smaller so it can be more easily spread across multiple servers. In doing so, you might create 100 tables with 4,000,000 rows apiece. However, a common problem with figuring out how to shard the data deals with “hot spots”. For example, if you run Flickr, and your 400,000,000 row table maps user IDs to the locations of their photos, and you break up the data by user ID (maybe a “user_1000-2000″ for users with IDs between 1000 and 2000), then that can cause your tables to be contain far less diverse data than you had before, and could potentially cause *worse* performance than you had before. I’ve tested this lots of times, and found that MySQL tends to make the right call in these cases. Perhaps it’s a bit counterintuitive, but if you test it, you’ll find the same thing.
For example, say that user 1000 has 400,000 photos (and therefore, 400,000 rows in the user_1000-2000 table), and the entire table contains a total of 1,000,000 rows. That means that user 1000 makes up 40% of the rows in the table. What should MySQL do? Should it perform 400,000 multi-step “find the index value, get the pointer to the data, go get the data” operations, or should it just perform a single pass over the whole table? At some point there must be a threshold at which performing a table scan becomes more efficient than using the index, and the MyISAM engine seems to set this threshold at around 30-35%. This doesn’t mean you made a huge mistake sharding your data — it just means you can’t assume that a simple index on ‘userID’ that worked in the larger table is going to suffice in the smaller one.
But what if there just isn’t much diversity to be had? Well, perhaps clustered indexing can help you, then. If you switch engines to InnoDB, it’ll use a clustered index for the primary key index, and depending on what that index consists of, and how that matches up with your queries, you may find a solution there. What I’ve found in my testing is that, presumably due to the fact that data is stored, in order, along with the index, the “table scan” threshold is much higher, because the number of IO operations MySQL has to perform to get at the actual data is lower. If you have index-covered queries that are covered by the primary key index, they should be blazing fast, where in MyISAM you’d be doing a table scan and lots of random I/O.
For the record, and I’m still investigating why this is, I’ve also personally found that secondary indexes seem to be faster than those in MyISAM, though I don’t believe there’s much in the way of an advertised reason why this might be. Input?
Joins, and Denormalization
For some time, I read about how sites like LiveJournal, Flickr, and lots of other sites dealt with scaling MySQL with my head turned sideways. “Denormalize?! Why would you do that?!” Sure enough, though, the call from on high at all of the conferences by all of the speakers seemed to be to denormalize your data to dramatically improve performance. This completely baffled me.
Then I learned how MySQL does joins. There’s no magic. There’s no crazy hashing scheme or merging sequence going on here. It is, as I understand it (I haven’t read the source), a nested loop. After learning this, and twisting some of my own data around and performing literally hundreds, if not thousands of test queries (I really enjoy devising test queries), I cringed, recoiled, popped up and down from the ceiling to the floor a couple of times like that Jekkyl and Hyde cartoon, and started (carefully, very carefully) denormalizing the data.
I cannot stress how carefully this needs to be done. It may not be completely obvious which data should be denormalized/duplicated/whatever. Take your time. There are countless references for how to normalize your data, but not a single one that’ll tell you “the right way” to denormalize, because denormalization itself is not considered by any database theorists to be “right”. Ever. In fact, I have read some great theorists, and they will admit that, in practice, there is room for “lack of normalization”, but they just mean that if you only normalize to 3NF (3rd Normal Form), that suits many applications’ needs. They do *NOT* mean “it’s ok to take a decently normalized database and denormalize it”. To them, normalization is a one way street. You get more normalized - never less. These theorists typically do not run highly scalable web sites. They seem to talk mostly in the context of reporting on internal departmental data sets with a predictable and relatively slow growth rate, with relatively small amounts of data. They do not talk about 10GB tables containing tens or hundreds of millions of rows, growing at a rate of 3-500,000 rows per day. For that, there is only anecdotal evidence that solutions work, and tribal war stories about what doesn’t work.
My advice? If you cannot prove that removing a join results in a dramatic improvement in performance, I’d rather perform the join if it means my data is relatively normalized. Denormalization may appear to be something that “the big boys” at those fancy startups are doing, but keep in mind that they’re doing lots of stuff they’d rather not do, and probably wouldn’t do, if they had the option (and if MySQL didn’t have O(n2 or 3) or similar performance with regard to joins).
Do You Have an IO Bottleneck?
This is usually pretty easy to determine if you’re on a UNIX-like system. Most UNIX-like systems come with an ‘iostat’ command, or have one readily available. Different UNIX variants show different ‘iostat’ output, but the basic data is the same, and the number you’re looking for is “iowait” or “%iowait”. On Linux systems, you can run ‘iostat -cx 2′ and that’ll print out, every 2 seconds, the numbers you’re looking for. Basically, %iowait is the percentage of time (over the course of the last 2-second interval) that the CPU had to hang around waiting for I/O to complete so it would have data to work with. Get a read of what this number looks like when there’s nothing special going on. Then take a look at it on a moderately loaded server. Use these numbers to gauge when you might have a problem. For example, if %iowait never gets above 5% on a moderately loaded server, then 25% might raise an eyebrow. I don’t personally like when those numbers go into double-digits, but I’ve seen %iowait on a heavily loaded server get as high as 98%!Ok, time for bed
I find database-related things to be really a lot of fun. Developing interesting queries that do interesting things with data is to me what crossword puzzles are to some people: a fun brain exercise, often done with coffee. Performance tuning at the query level, database server level, and even OS level, satisfies my need to occasionally get into the nitty-gritty details of how things work. I kept this information purposely kind of vague to focus on high-level concepts with little interruption for drawn out examples, but if you’re reading this and have good examples that support or refute anything here, I’m certainly not above being wrong, so please do leave your comments below!Blogged with Flock
Ned Batchelder's blog
Boy vs. girl
Here's a short but complicated question: if grown men object to being called "boy", why don't grown women object to being called "girl"?
I was raised in New York City in the 1970's by a radical lesbian feminist, and to my ears, "girl" is completely wrong. I'm always a little thown by hearing adults referred to as girl. It seems demeaning, but plenty of women refer to themselves that way. Am I completely out of touch? Are they?
Greg Wilson
Aaaand They’re Off!
Our summer interns started this morning—we got Summer of Code, we got NSERC USRA, we got ITCDF, we got you name it, a lab and a half’s worth. I gave the least coherent welcoming speech of my life (bad cold, little sleep), our trusty sys admin Alan helped ‘em get their accounts set up, and whoosh, off they went.
And speaking off “off they went”: we have been very lucky these past few months to have David Wolever working full-time on DrProject. He and his fiancee are headed to Brazil this summer to do some digital inclusivity work, but they’ll be back in the fall. Safe journey; we look forward to hearing about your adventures.
- Cytoscape Graph Layout: Victoria Mui (GSoC)
- Documentation/Testing: Luke Petrolekas (GVW)
- Dojo Form Editor: Jeff Balogh (GSoC)
- DrProject:
- Admin: Qiyu Zhu (GSoC)
- Chat: Kosta Zabashta (USRA)
- Miscellaneous: Nick Jamil (GVW)
- Eclipse Feature Diagram Plugin: Nicole Allard (CSC494)
- Web-CAT:
- Eclipse Plugin: Geofrey Flores (USRA), Qi Yang (GSoC)
- Python Back End: Eran Henig (GSoC)
- Flare Dataflow Editor: Ming Chow (CSC494/495), Wenbing Li (CSC494/495)
- Hackystat
- Data Visualization: Eva Wong (GSoC)
- Visual Studio Plugin: Matthew Basset (GSoC)
- OS161 Visualization: Xuan Le (ITCDF), Edward Robinson (CSC495)
- OpenAFS Control Panel: Joseph Yeung (OpenAFS)
- PyGraphics: Chris Maddison (ITCDF), Hardeep Singh (ITCDF)
- SlashID: Dmitri Vassiliev (CSC494/495)
May 12, 2008
Beginning Python for Bioinformatics
Obtaining overrepresented motifs in DNA sequences, part 4
We found a way to make the Python script as good as or better than the C++ executable. But for the analysis we need to do, motif counts are not the value we want. We need the quorum: the number of sequences the motif is present at least once. For instance, if the desired motifs was AAACCCTTTG we will check in which sequences this word was present. Let’s say in a cluster of 10 sequences, we would find it in sequences 1, 2, 3, 4 and 5, giving us a quorum of 5 out of 10, or 50%. The quorum will be used in the future in the statistical calculation in order to determine the overrepresented motifs.
With only a couple of modifications, we can adapt the script used to get the motif counts to get the quorum.
#!/scratch/python/bin/python
from collections import defaultdict
import sys
import fasta
seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])
quorum = defaultdict(list)
seq_number = 0
for i in seqs:
seq_number += 1
for n in range(len(i.sequence) - int(length)):
if not seq_number in quorum[i.sequence[n : n + length]]:
quorum[i.sequence[n : n + length]].append(seq_number)
for i in quorum:
print ''.join(i).upper(), len(quorum[i])
Basically, we change the way the defaultdict is initialized, this time as a list instead of int and we also change the procedure that used to get the counts. The loop does identical work, iterating along the sequences, with a window (of the input length) sliding on them and checking each word. This time instead of incrementing the value of the defaultdict, we append to the list the sequence number, obtained from a index integer variable (incremented in each iteration of the loop), if this number is not already in the list value. In the end each value of quorum will be a list os numbers and by printing the list length we obtain the quorum. Testing the above script there is no performance loss when comparing to the previous count script.
Next we will see which statistical method to use and start to devising an script to calculate it.
- Python Libraries
- Python/Web Planets
- Other Languages
- Subscriptions
- [OPML feed]
- Aahz's Weblog
- Abe Fettig
- Aftermarket Pipes
- Alessandro Iob
- Andre Roberge
- Andrew Bennetts
- Andrew Dalke
- Andrew R. Gross
- Andy Dustman
- Andy Todd
- Anthony Baxter
- Arthur Koziel
- Baiju M.
- Base-Art / Articles
- Beginning Python for Bioinformatics
- Ben Bangert
- Benji York
- Blue Sky On Mars (Python)
- Brandon Rhodes
- Brett Cannon
- Brian Jones
- Calvin Spealman
- Carlos de la Guardia
- Chad Whitacre
- Chris McAvoy
- Chris McDonough
- Christopher Lenz
- Chui Tey
- Corey Goldberg
- Cosmic Seriosity Balance
- Daniel Nouri
- Darryl VanDorp
- David Ascher
- David Goodger
- David Stanek
- Deadly Bloody Serious about Python
- Dethe Elza
- Doug Hellmann
- Doug Napoleone
- Enthought
- EuroPython Conference
- Feet up! : dev/python
- Flavio Coelho
- Floris Bruynooghe
- Frank Wierzbicki
- Fredrik Lundh
- Gary Poster
- Georg Brandl
- Glenn Franxman
- Glyph Lefkowitz
- Graham Dumpleton
- Greg Wilson
- Grig Gheorghiu
- Guido van Rossum's Weblog
- Gustavo Niemeyer
- Guyon Moree
- Hans Nowak
- IPython0 blog
- Ian Bicking
- IronPython-URLs
- James Tauber
- Jason Diamond
- Jeff Rush
- Jeff Shell
- Jehiah Czebotar
- Jeremy Hylton
- Jesse Noller
- Jkx@home
- Johannes Woolard
- JotSite.com
- Julien Anguenot
- Juri Pakaste
- Krys Wilken
- Kumar McMillan
- Laurent Szyster
- Lennart Regebro
- Level++
- Life of Brian Ray - Python
- Marius Gedminas
- Mark Dufour
- Mark Nottingham
- Mark Ramm-Christensen
- Mathieu Fenniak's Weblog
- Matt Goodall
- Matt Harrison
- Matt Kaufman
- Matthew Wilson
- Max Ischenko' blog
- Max Khesin
- Michael Bayer
- Michael Hudson
- Michael J.T. O'Kelly
- Mike Pirnat
- Muharem Hrnjadovic
- Neal Norwitz
- Ned Batchelder's blog
- Nick Efford
- Patrick Roberts's Blog
- Paul Everitt
- Paul Harrison
- Peter Bengtsson
- Peter Hunt
- Phil Hassey
- Philip Jenvey
- Philip Lindsay
- Philipp von Weitershausen
- Phillip J. Eby
- PyAMF Blog
- PyCon
- PyCon 2007 Podcast
- PyCon 2008 Podcast
- PyCon 2008 on YouTube
- PyPy Development
- Python 411 Podcast
- Python Advocacy
- Python Magazine
- Python News
- Python Postings
- Python Secret Weblog
- Python Software Foundation
- Python User Groups
- Pythonology
- Rene Dudfield
- Richard Jones' Log: Python
- Robert Brewer
- Roberto De Almeida
- Robin Dunn
- Ryan Phillips
- SPE Weblog
- Sean McGrath
- Second p0st
- ShowMeDo
- Simon Belak
- Simon Wittber
- Small Values of Cool
- SnapLogic
- Speno's Pythonic Avocado
- Spyced
- Steve Holden
- Supervisor
- Swaroop C H, The Dreamer » Python
- Tarek Ziade
- Ted Leung
- Tennessee Leeuwenburg
- Tero Kuusela
- The Law Of Unintended Consequences
- The Python Papers
- The Voidspace Techie Blog
- Tim Golden
- Tim Parkin
- Titus Brown
- Troy Melhase
- V.S. Babu
- VirtualVitriol
- Will McGugan
- Will's blog
- it's getting better
- ivan krstić · code culture
- keyphrene.com
- markpasc.org weblog: python edition
- planet.python.org updates
- python-dev summaries
- To request addition or removal:
e-mail webmaster at python.org





