CARVIEW |
Four short links: 4 June 2010
Arduino Home, DIY Lightning with Water and Gravity, Graph Visualization, and Neglected Diseases
by Nat Torkington | @gnat | comments: 0
- HomeSense -- an open user-centered research project investigating the use of smart and networked technologies in the home, with uber-Arduino-rockstar Alexandra Deschamps-Sonsino. (via titine on Twitter)
- Kelvin's Thunderstorm (Instructables) -- "create lightning from water and gravity". Simple and impressive science. (via Paul Fenwick)
- Graph Visualization Code in Javascript (Stack Overflow) -- good pointers to interesting libraries.
- ChEMBL - Neglected Tropical Disease archive -- a repository for Open Access primary screening and medicinal chemistry data directed at neglected diseases. CC0-licensed datasets identifying several tens of thousands of compounds active against the malarial parasite P. falciparum in an effort to lower the cost of drug creation for this neglected disease. (via Common Knowledge blog)
tags: arduino, data science, hardware hacking, javascript, math, open data, science, science commons, visualization
| comments: 0
submit:
"Hackers" at 25
It's been 25 years since "Hackers" was published. Author Steven Levy reflects on the book and the movement.
by Mac Slocum | @macslocum | comments: 2
Steven Levy wrote a book in the mid-1980s that introduced the term "hacker" -- the positive connotation -- to a wide audience. In the ensuing 25 years, that word and its accompanying community have gone through tremendous change. The book itself became a mainstay in tech libraries.
O'Reilly recently released an updated 25th anniversary edition of "Hackers," so I checked in with Levy to discuss the book's development, its influence, and the role hackers continue to play.
Writing "Hackers"
Do you remember the original pitch for "Hackers"?
Steven Levy: I don't remember it, though I can tell you that it didn't wind up being what the book was. I thought I was going to embark on a series of magazine articles.
Soon after I started researching, it seemed like it was going to be a two-part book starting with the Homebrew Computer Club and then the game hackers and that emerging industry. But then I realized that the whole hacker culture started at MIT. That was where I had to go, and it turned out to be a key section of the book.
Of all the stories and profiles in the book, which resonated most with you?
SL: The MIT story was just amazing. I stumbled upon this important history and no one else had chronicled it. It's difficult to overestimate how important that community was to hatching the culture of hacking, and really the culture of computing. It had ripples far beyond the hacker community that went out to the way we all use computers.
I would learn about these people like Richard Greenblatt and Bill Gosper, that no one ever heard of. The way they expressed themselves and the reverberations they created were very influential. They were legends within the walls of MIT.
tags: hackers, hacking, hacks
| comments: 2
submit:
Connecting the dots with Intellipedia
U.S. intelligence agencies are using an internal wiki for knowledge sharing.
by Alex Howard | @digiphile | comments: 3
This April, Intellipedia celebrated its fourth anniversary. As the federal government considers building a new internal social network, "Fedspace," the lessons learned from Intellipedia are worth considering. Last week, I spoke with Don Burke and Sean Dennehy, two long-time CIA officers who have been both the public faces of Intellipedia and internal evangelists since its inception.
So is Intellipedia working? Read more, after the jump.
tags: gov 2.0, government, government 2.0, Intellipedia, wiki
| comments: 3
submit:
How Facebook satisfied a need for speed
Facebook boosted speed 2x. Director of engineering Robert Johnson explains how.
by Mac Slocum | @macslocum | comments: 4
Remember how Facebook used to lumber and strain? And have you noticed how it doesn't feel slow anymore? That's because the engineering team pulled off an impressive feat: an in-depth optimization and rewrite project made the site twice as fast.
Robert Johnson, Facebook's director of engineering and a speaker at the upcoming Velocity and OSCON conferences, discusses that project and its accompanying lessons learned below. Johnson's insights have broad application -- you don't need hundreds of millions of users to reap the rewards.
Facebook recently overhauled its platform to improve performance. How long did that process take to complete?
Robert Johnson: Making the site faster isn't something we're ever really done with, but we did make a big push the second half of last year. It took about a month of planning and six months of work to make the site twice as fast.
tags: facebook, optimization, speed
| comments: 4
submit:
Four short links: 3 June 2010
Passionate Users, Mail APIs, Phone Hacking, and Patent Data Online
by Nat Torkington | @gnat | comments: 0
- How to Get Customers Who Love You Even When You Screw Up -- a fantastic reminder of the power of Kathy Sierra's "I Rock" moments. In that moment I understood Tom's motivation: Tom was a hero. (via Hacker News)
- Yahoo! Mail is Open for Development -- you can write apps that sit in Yahoo! Mail, using and extending the UI as well as taking advantage of APIs that access and alter the email.
- Canon Hack Development Kit -- hack a PowerShot to be controlled by scripts. (via Jon Udell)
- 10TB of US PTO Data (Google Books) -- the PTO has entered into a two year deal with Google to distribute patent and trademark data for free. At the moment it's 10TB of images and full text of grants, applications, classifications, and more, but it will grow over time: in the future we will be making more data available including file histories and related data. (via Google Public Policy blog post)
tags: apis, Creating Passionate Users, Google Books, hacks, mail, open data, patents, photography, startups, yahoo
| comments: 0
submit:
Velocity Culture: Web Operations, DevOps, etc...
by Jesse Robbins | @jesserobbins | comments: 0
Velocity 2010 is happening on June 22-24 (right around the corner!). This year we've added third track, Velocity Culture, dedicated to exploring what we've learned about how great teams and organizations work together to succeed at scale.
Web Operations, or WebOps, is what many of us have been calling these ideas for years. Recently the term "DevOps" has become a kind of rallying cry that is resonating with many, along with variations on Agile Operations. No matter what you call it, our experiences over the past decade taught us that Culture matters more than any tool or technology in building, adapting, and scaling the web.
Here is a small sample of the upcoming Velocity Culture sessions:
Ops Meta-Metrics: The Currency You Use to Pay For Change
Presenter: John Allspaw (Etsy.com)
Change to production environments can cause a good deal of stress and strain amongst development and operations teams. More and more organizations are seeing benefits from deploying small code changes more frequently, for stability and productivity reasons. But how can you figure out how much change is appropriate for your application or your culture?
A Day in the Life of Facebook Operations
Presenter: Tom Cook (Facebook)
Facebook’s Technical Operations team has to balance this need for constant availability with a fast-moving and experimental engineering culture. We release code every day. Additionally, we are supporting exponential user growth while still managing an exceptionally high radio of users per employee within engineering and operations.
tags: cloud, development, devops, operations, velocity10, velocity2010, velocityconf, web2.0, webops
| comments: 0
submit:
What is data science?
Analysis: The future belongs to the companies and people that turn data into products.
by Mike Loukides | @mikeloukides | comments: 28
We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?
In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets.
Sections:
- What is data science?
- Where data comes from
- Working with data at scale
- Making data tell its story
- Data scientists
What is data science?
The web is full of "data-driven apps." Almost any e-commerce application is a data-driven application. There's a database behind a web front end, and middleware that talks to a number of other databases and data services (credit card processing companies, banks, and so on). But merely using data isn't really what we mean by "data science." A data application acquires its value from the data itself, and creates more data as a result. It's not just an application with data; it's a data product. Data science enables the creation of data products.
One of the earlier data products on the Web was the CDDB database. The developers of CDDB realized that any CD had a unique signature, based on the exact length (in samples) of each track on the CD. Gracenote built a database of track lengths, and coupled it to a database of album metadata (track titles, artists, album titles). If you've ever used iTunes to rip a CD, you've taken advantage of this database. Before it does anything else, iTunes reads the length of every track, sends it to CDDB, and gets back the track titles. If you have a CD that's not in the database (including a CD you've made yourself), you can create an entry for an unknown album. While this sounds simple enough, it's revolutionary: CDDB views music as data, not as audio, and creates new value in doing so. Their business is fundamentally different from selling music, sharing music, or analyzing musical tastes (though these can also be "data products"). CDDB arises entirely from viewing a musical problem as a data problem.
Google is a master at creating data products. Here's a few examples:
-
Google's breakthrough was realizing that a search engine could use input other than the text on the page. Google's PageRank algorithm was among the first to use data outside of the page itself, in particular, the number of links pointing to a page. Tracking links made Google searches much more useful, and PageRank has been a key ingredient to the company's success.
Spell checking isn't a terribly difficult problem, but by suggesting corrections to misspelled searches, and observing what the user clicks in response, Google made it much more accurate. They've built a dictionary of common misspellings, their corrections, and the contexts in which they occur.
-
Speech recognition has always been a hard problem, and it remains difficult. But Google has made huge strides by using the voice data they've collected, and has been able to integrate voice search into their core search engine.
-
During the Swine Flu epidemic of 2009, Google was able to track the progress of the epidemic by following searches for flu-related topics.
Flu trends
Google was able to spot trends in the Swine Flu epidemic roughly two weeks before the Center for Disease Control by analyzing searches that people were making in different regions of the country.
Google isn't the only company that knows how to use data. Facebook and LinkedIn use patterns of friendship relationships to suggest other people you may know, or should know, with sometimes frightening accuracy. Amazon saves your searches, correlates what you search for with what other users search for, and uses it to create surprisingly appropriate recommendations. These recommendations are "data products" that help to drive Amazon's more traditional retail business. They come about because Amazon understands that a book isn't just a book, a camera isn't just a camera, and a customer isn't just a customer; customers generate a trail of "data exhaust" that can be mined and put to use, and a camera is a cloud of data that can be correlated with the customers' behavior, the data they leave every time they visit the site.
The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That's the beginning of data science.
In the last few years, there has been an explosion in the amount of data that's available. Whether we're talking about web server logs, tweet streams, online transaction records, "citizen science," data from sensors, government data, or some other source, the problem isn't finding data, it's figuring out what to do with it. And it's not just companies using their own data, or the data contributed by their users. It's increasingly common to mashup data from a number of sources. "Data Mashups in R" analyzes mortgage foreclosures in Philadelphia County by taking a public report from the county sheriff's office, extracting addresses and using Yahoo to convert the addresses to latitude and longitude, then using the geographical data to place the foreclosures on a map (another data source), and group them by neighborhood, valuation, neighborhood per-capita income, and other socio-economic factors.
The question facing every company today, every startup, every non-profit, every project site that wants to attract a community, is how to use data effectively -- not just their own data, but all the data that's available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We're increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.
To get a sense for what skills are required, let's look at the data lifecycle: where it comes from, how you use it, and where it goes.
tags: data, data science, data scientist
| comments: 28
submit:
Making community health information as useful as weather data
Open health data from Health and Human Services is driving more than 20 new apps.
by Alex Howard | @digiphile | comments: 0
tags: gov 2.0, government 2.0, health 2.0, health information technology, HHS
| comments: 0
submit:
Four short links: 2 June 2010
WikiLeaks Ethics, Education Business Opportunities, Corewar Updated, Watch Google IO
by Nat Torkington | @gnat | comments: 0
- Wikileaks Launched on Stolen Documents (Wired) -- Wired claims the first set of documents was obtained by running a Tor node that users connected to ("exit node") and saving the plaintext that was sent to the users, without their knowledge. Reminds me of the adage that nothing big in Silicon Valley starts without being some degree of evil first: YouTube turning a blind eye to copyright infringement, Facebook games and spam, etc.
- VC Investments in Education -- Cleantech investors are chasing a 3x larger market than Education and yet are putting 50-60x the money to work chasing those returns.
- Cells: A Massively Multi-Agent Python Programming Game -- a sweet-looking update on the old Core War game.
- Google IO 2010 Session Videos Online -- I'm keen to learn more about BigData and Prediction APIs, which seem to me an eminently sensible move by Google to play to their strengths.
tags: business, education, ethics, events, games, Google I/O, investments, open source, programming, python, vc, wikileaks
| comments: 0
submit:
Four short links: 1 June 2010
Legal XML, Big Social Data, Crowdsourcing Tips, Copyright Balkanization
by Nat Torkington | @gnat | comments: 1
- XML in Legislature/Parliament Environments (Sean McGrath) -- quite detailed background on the use of XML in legislation drafting systems, and the problems caused by convention in that world--page/line number citations, in particular. (Quick gloat: NZ's legislature management system is kick-ass, and soon we'll switch from print authoritative to digital authoritative)
- Large-Scale Social Media Analysis with Hadoop -- In this tutorial we will discuss the use of Hadoop for processing large-scale social data sets. We will first cover the map/reduce paradigm in general and subsequently discuss the particulars of Hadoop's implementation. We will then present several use cases for Hadoop in analyzing example data sets, examining the design and implementation of various algorithms with an emphasis on social network analysis. Accompanying data sets and code will be made available. (via atlamp on Delicious)
- Breaking Monotony with Meaning; Motivation in Crowdsourcing Markets (Crowdflower) -- This finding has important implications for those who employ labor in crowdsourcing markets. Companies and intermediaries should develop an understanding of what motivates the people who work on tasks. Employers must think beyond monetary incentives and consider how they can reward workers through non-monetary incentives such as by changing how workers perceive their task. Alienated workers are less likely to do work if they don’t know the context of the work they are doing and employers may find they can get more work done for the same wages simply by telling turkers why they are working.
- Balkanizing the Web -- The very absurdity of the global digital system is revealing itself. It created all the instruments for global access and, then, turned around and arbitrarily restricted its commercial use, paving the way for piracy. Think about it: our broadband networks now allow seamless streaming of films, TV shows, music and, soon, of a variety of multimedia products; we have created sophisticated transaction systems; we are getting extraordinary devices to enjoy all this; there is a growing English-speaking population that, for a significant part of it, is solvent and eager to buy this globalized culture and information. But guess what? Instead of a well-crafted, smoothly flowing distribution (and payment) system, we have these Cupertino, Seattle or Los Angeles-engineered restrictions. The U.S. insists on exporting harsh copyright penalties and restrictions, while not exporting license agreements and Fair Use, so the rest of the world gets very grumpy.
tags: big data, copyright, Crowdflower, crowdsourcing, gov20, hadoop, social graph, xml
| comments: 1
submit:
Four short links: 31 May 2010
Data and Context, Twitpic Hot or Not, Failing to Save Journalism, Flash in Javascript
by Nat Torkington | @gnat | comments: 0
- Transparency is Not Enough (danah boyd) -- we need people to not just have access to the data, but have access to the context surrounding the data. A very thoughtful talk from Gov 2.0 Expo about meaningful data release.
- Feed6 -- the latest from Rohit Khare is a sort of a "hot or not" for pictures posted to Twitter. Slightly addictive, while somewhat purposeless. Remarkable for how banal the "most popular" pictures are, it reminds me of the way Digg, Reddit, and other such sites trend towards the uninteresting and dissatisfying. Flickr's interestingness still remains one of the high points of user-curated notability. (via rabble on Twitter)
- Potential Policy Recommendations to Support the Reinvention of Journalism (PDF) -- FTC staff discussion document that floats a number of policy proposals around journalism: additional IP rights to defend against aggregators like Google News; protection of "hot news" facts; statutory limits to "fair use"; antitrust exemptions for cartel paywalls; and more. Jeff Jarvis hates it, but Alexander Howard found something to love in the proposal that the government "maximize the easy accessibility of government information" to help journalists find and investigate stories more easily. (via Jose Antonio Vargas)
- Smokescreen -- a Flash player in Javascript. See Simon Willison's explanation of how it works. Was created by the fantastic Chris Smoak, who was an early Google Maps hacker and built the BusMonster interface to Seattle public transport. (via Simon Willison)
tags: collective intelligence, data, Flash, gov2.0, javascript, journalism, programming, transparency, twitter
| comments: 0
submit:
Putting Online Privacy in Perspective
by Tim O'Reilly | @timoreilly | comments: 40When I wrote last week about the Facebook privacy flap, I was speaking out of the frustration that many technologists with a sense of perspective feel when we see uninformed media hysteria about the impact of new technology. (How many of you remember all the scare stories about the risks of using a credit card online from back in the mid-1990s, all of them ignoring the risks that consumers blithely took for granted in the offline world?)
Search engine expert Danny Sullivan vented some of this frustration on a private mailing list the other day. He gave me permission to reprint his remarks here. Danny was responding to a discussion of a Washington Post story about online privacy that started out with concerns about how information posted online is routinely being discovered and used against people in legal cases. (But even then, as you'll see, they left out a crucial part of the story.)
But then, the story goes on to link these cases with the general idea of data collection online.
In the 15 years since the World Wide Web brought the Internet to the masses, the most successful companies have been those that collect information about users and use it to sell things. Google, for instance, has confirmed that it keeps track of search queries sent from a particular IP address. (A spokesman said the company anonymizes IP addresses associated with search queries after nine months and cookies after 18 months.)The problem with linking these two ideas is that the kind of data in the examples above is exactly the kind of data online companies need to collect in order to manage and improve their services. They are a lot like the data collected by your car - some of which, like your speed, is reported to you, and much of which is only reported to a mechanic via a diagnostic computer. That this kind of data is collected is not only no surprise to computer professionals, it's taught as basic practice!Companies are loath to talk about what information they track, but internal compliance manuals for law enforcement for Facebook, Yahoo and Microsoft reviewed by The Washington Post show that their data collection is much more extensive than users might believe based on what they themselves can access.
For example: Microsoft tracks the Xbox LIVE start and end dates and times for game-playing and notes the game played, such as "SW: Jedi Academy." Yahoo keeps chat and instant messenger logs for 45 to 60 days and notes the time/date and IP address for when content is added or deleted to someone's profile or to its Flickr photo service.
Facebook's data collection is among the most detailed.
For every user id, Facebook keeps a log of the IP address that accessed the account, the date and time, and what exactly the user did -- clicking on an advertisement, looking at someone else's profile, posting a photo or sending a message to a friend, etc.
Danny was particularly put off by the hysteria about well-known facts, and by the scrutiny given to trivial pieces of online data collection while ignoring far more massive collection of data by more familiar means. He wrote:
Heh. Google has confirmed it tracks queries to a particular IP address. Like this wasn't something we knew for any search engine back in say, 1995. Or as if Google ever made a secret of it. Or more to the point, like tracking to an IP address is the issue versus the bigger issue of people having search histories (if people opt in) linked to real, personally identifiable accounts.There are real privacy issues to be faced in the data collected by web companies. But they are part of a far bigger picture of how the world is changing. We need thoughtful understanding of what the real risks are, not finger pointing by the media (and even more frighteningly, by members of Congress) at companies that are easy targets because they make good political theater.Heaven help us, though -- let's keep talking IP addresses and cookies. And let's ignore the fact that in virtually every court case where search queries have been notable as evidence, those queries were obtained ... wait for it ... off the person's own computer. Dude, when you're searching for ways to kill your wife, clear your browser history. Seriously, sad but true story.
I think the internet companies are indeed going to face more scrutiny, because they are big fat targets for lazy legislators who are loathe to provide some real security over, I dunno, my credit card purchases?
I mean, can you imagine if when using Google and Yahoo and Bing, they reported all your searches to a "search bureau" that was pretty easy for anyone to access? Oh, and if you disagreed with something listed, well, good luck with getting that removed. But we tolerate that bull from our credit card companies.
My credit card company knows everything I've purchased, which is a pretty personal trail. That doesn't get "anonymized" after 9 months or 18 months. I have no idea at all what happens to it. I can't, like at Google, push a button and make it go poof, either. I don't think I have any rights over it at all.
My grocery store knows all the things I've purchased using my store discount card -- no idea who they hand that out to.
My telephone company keeps my phone records for I don't know how long. Imagine that. They know who I called and for how long.
But yeah, thank you Washington Post for focusing on the fact that Xbox Live keeps track of when I began and ended my game playing. Yeah, thanks for spending time talking about IP addresses. Could they have shoved even one paragraph of perspective in there? Could we get one of the privacy groups to maybe call for some better national standards protecting user information on and OFFline? If they are, I never hear the offline part.
Rant over. I've just seen this same obsession with IP addresses over years. Years and years, rather than focusing on the bigger and more important privacy issues on a broader perspective.
tags: facebook, google, privacy
| comments: 40
submit:
Recent Posts
- California: There's an app for that | by Alex Howard on May 28, 2010
- Data and simplicity can build the government platform | by Mac Slocum on May 28, 2010
- Gov 2.0 Week in Review | by Alex Howard on May 28, 2010
- Four short links: 28 May 2010 | by Nat Torkington on May 28, 2010
- Tim Berners-Lee on Data.gov.uk, open linked data and open standards | by Alex Howard on May 27, 2010
- Four short links: 27 May 2010 | by Nat Torkington on May 27, 2010
- Venture capitalists do it. Why shouldn't philanthropists do it, too? | by Elizabeth Corcoran on May 26, 2010
- Crisis Commons releases open source oil spill reporting | by Alex Howard on May 26, 2010
- Four short links: 26 May 2010 | by Nat Torkington on May 26, 2010
- Facebook Open Graph: A new take on semantic web | by Alex Iskold on May 25, 2010
- Four short links: 25 May 2010 | by Nat Torkington on May 25, 2010
- The iPad and immersive computing | by Marc Hedlund on May 24, 2010
STAY CONNECTED
RECOMMENDED FOR YOU
O'Reilly Home | Privacy Policy © 2005 - 2010, O'Reilly Media, Inc. | (707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.