How signals, geometry, and topology are influencing data science

Areas concerned with shapes, invariants, and dynamics, in high-dimensions, are proving useful in data analysis

I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas that deal in shapes, invariants, and dynamics, in high-dimensions, would have something to contribute to the analysis of large data sets. Without further ado, here are a few examples that stood out for me. (If you know of other examples of recent applications of math in data analysis, please share them in the comments.)

Compressed Sensing
Compressed sensing is a signal processing technique which makes efficient data collection possible. As an example using compressed sensing images can be reconstructed from small amounts of data. Idealized Sampling is used to collect information to measure the most important components. By vastly decreasing the number of measurements to be collected, less data needs to stored, and one reduces the amount of time and energy1 needed to collect signals. Already there have been applications in medical imaging and mobile phones.

The problem is you don’t know ahead of time which signals/components are important. A series of numerical experiments led Emanuel Candes to believe that random samples may be the answer. The theoretical foundation as to why a random set of signals would work, where laid down in a series of papers by Candes and Fields Medalist Terence Tao2.

Read more…

Comment |

Strata Week: Intel wants you to reap the benefits from your personal data

Intel's Data Economy Initiative, your personal records are exposed, Sears gets into the data center business, and ODI wants Git for data publishing.

Intel’s taking the lead in the new “data economy”

Intel is looking to take the lead in what it has dubbed the “data economy,” helping consumers and individuals realize and retain more value from their personal data. Antonio Regalado and Jessica Leber report at MIT Technology Review that the the world’s largest computer chip maker has launched a “Data Economy Initiative.” Ken Anderson, a cultural anthropologist who is in charge of the project, described the initiative to them as “a multiyear study whose goal is to explore new uses of technology that might let people benefit more directly, and in new ways, from their own data.”

As part of the initiative, Intel is funding hackathons to encourage developers to experiment with personal data in new ways, Regalado and Leber note. “[Intel] has also paid for a rebellious-sounding website called We the Data,” they report, “featuring raised fists and stories comparing Facebook to Exxon Mobil.” Read more…

Comment |

Visualization of the Week: CIA rendition flights of terror suspects

The Rendition Project has published an interactive visualization of three year's worth of suspected rendition flights.

The Rendition Project, a collaboration between academics at Kent and Kingston universities and the NGO Reprieve, has developed an interactive visualization of the extent of CIA rendition flights of terror suspects.

Read more…

Comment |

The elusive quest to transform healthcare through patient empowerment

We need to provide data to patients in a form they can understand

Would you take a morning off from work to discuss health care costs and consumer empowerment in health care? Over a hundred people in the Boston area did so on Monday, May 6, for the conference “Empowering Healthcare Consumers: A Community Conversation Conference” at the Suffolk Law School. This fast-paced and wide-ranging conference lasted just long enough to show that hopes of empowering patients and cutting health care costs (which is the real agenda behind most of the conference organizers) run up against formidable hurdles–many involving the provision of data to these consumers.
Read more…

Comment |

Looking ahead to a world of data-dominated decisions

Review of Mayer-Schönberger and Cukier's Big Data

Measuring a world-shaking trend with feet planted in every area of human endeavor cannot be achieved in a popular book of 200 pages, but one has to start somewhere. I am happy to recommend the adept efforts of Viktor Mayer-Schönberger and Kenneth Cukier as a starting point. Their recent book Big Data: A Revolution That Will Transform How We Live, Work, and Think (recently featured in a video interview on the O’Reilly Strata site) does not quite unravel the mystery of the zeal for recording and measurement that is taking over governments and business, but it does what a good popularization should: alert us to what’s happening, provide some frameworks for talking about it, and provide a launchpad for us to debate the movement’s good and evil.

Because readers of this blog have been grappling with these concerns for some time. I’ll provide the barest summary of topics covered in Mayer-Schönberger and Cukier’s extensive overview, then provide some complementary ideas of my own.
Read more…

Comments: 2 |

Six disruptive possibilities from big data

Specific ways big data will inundate vendors and customers.

Disruptive PossibilitiesMy new book, Disruptive Possibilities: How Big Data Changes Everything, is derived directly from my experience as a performance and platform architect in the old enterprise world and the new, Internet-scale world.

I pre-date the Hadoop crew at Yahoo!, but I intimately understood the grid engineering that made Hadoop possible. For years, the working title of this book was The Art and Craft of Platform Engineering, and when I started working on Hadoop after a stint in the Red Hat kernel group, many of the ideas that were jammed into my head, going back to my experience with early supercomputers, all seem to make perfect sense for Hadoop. This is why I frequently refer to big data as “commercial supercomputing.”

In Disruptive Possibilities, I discuss the implications of the big data ecosystem over the next few years. These implications will inundate vendors and customers in a number of ways, including: Read more…

Comment: 1 |

Improving options for unlocking your graph data

Graph data is an area that has attracted many enthusiastic entrepreneurs and developers

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).

Data wrangling: creating graphs
Before you can take advantage of the other tools mentioned in this post, you’ll need to turn your data (e.g., web pages) into graphs. GraphBuilder is an open source project from Intel, that uses Hadoop MapReduce1 to build graphs out of large data sets. Another option is the combination of GraphX/Spark described below. (A startup called Trifacta is building a general-purpose, data wrangling tool, that could help as well. )

Read more…

Comments: 3 |

Strata Week: Are customized Google maps a neutrality win or the next “filter bubble”?

Two views on new Google Maps; a look at predictive, intelligent apps; and Aaron Swartz's and Kevin Poulsen's anonymous inbox launches.

Google aims for a new level of map customization

Google introduced a new version of Google maps at Google I/O this week that learns from each use to customize itself to individual users, adapting based on user clicks and searches. A post on the Google blog outlines the updates, which include recommendations for places you might enjoy (based upon your map activity), ratings and reviews, integrated Google Earth, and tours generated from user photos, to name a few.

Read more…

Comment |

On becoming a code artist

An interview with Scott Murray, author of Interactive Data Visualization for the Web

Scott Murray, a code artist, has written Interactive Data Visualization for the Web for nonprogrammers. In this interview, Scott provides some insights on what inspired him to write an introduction to D3 for artists, graphic designers, journalists, researchers, or anyone that is looking to begin programming data visualizations.

What inspired you to become a code artist?

Scott Murray

Scott Murray

Scott Murray: I had designed websites for a long time, but several years ago was frustrated by web browsers’ limitations. I went back to school for an MFA to force myself to explore interactive options beyond the browser. At MassArt, I was introduced to Processing, the free programming environment for artists. It opened up a whole new world of programmatic means of manipulating and interacting with data — and not just traditional data sets, but also live “data” such as from input devices or dynamic APIs, which can then be used to manipulate the output. Processing let me start prototyping ideas immediately; it is so enjoyable to be able to build something that really works, rather than designing static mockups first, and then hopefully, one day, invest the time to program it. Something about that shift in process is both empowering and liberating — being able to express your ideas quickly in code, and watch the system carry out your instructions, ultimately creating images and experiences that are beyond what you had originally envisioned.

Read more…

Comment |

Visualization of the Week: Real-time Wikipedia edits

The Wikipedia Recent Changes Map visualizes Wikipedia edits around the world in real-time.

Stephen LaPorte and Mahmoud Hashemi have put together an addictive visualization of real-time edits on Wikipedia, mapped across the world. Every time an edit is made, the user’s location and the entry they edited are listed along with a corresponding dot on the map.

Wikipedia-Recent-Changes-Map

Click here for the full visualization.


Read more…

Comment |