Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 302 server: nginx date: Sun, 03 Aug 2025 07:57:47 GMT content-type: text/plain; charset=utf-8 content-length: 0 x-archive-redirect-reason: found capture at 20080913184620 location: https://web.archive.org/web/20080913184620/https://www.hackszine.com/blog/archive/data/ server-timing: captures_list;dur=0.449855, exclusion.robots;dur=0.016723, exclusion.robots.policy;dur=0.008883, esindex;dur=0.009187, cdx.remote;dur=83.437152, LoadShardBlock;dur=185.496047, PetaboxLoader3.datanode;dur=64.014653, PetaboxLoader3.resolve;dur=41.991437 x-app-server: wwwb-app216 x-ts: 302 x-tr: 291 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 set-cookie: wb-p-SERVER=wwwb-app216; path=/ x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() HTTP/2 200 server: nginx date: Sun, 03 Aug 2025 07:57:48 GMT content-type: text/html x-archive-orig-date: Sat, 13 Sep 2008 18:46:19 GMT x-archive-orig-server: Apache x-archive-orig-last-modified: Mon, 08 Sep 2008 02:56:52 GMT x-archive-orig-etag: "1dc8087-1a58e-48c49474" x-archive-orig-accept-ranges: bytes x-archive-orig-content-length: 107918 x-archive-orig-connection: close x-archive-guessed-content-type: text/html x-archive-guessed-charset: utf-8 memento-datetime: Sat, 13 Sep 2008 18:46:20 GMT link: ; rel="original", ; rel="timemap"; type="application/link-format", ; rel="timegate", ; rel="first memento"; datetime="Sun, 18 Feb 2007 18:34:38 GMT", ; rel="prev memento"; datetime="Fri, 13 Jun 2008 08:21:14 GMT", ; rel="memento"; datetime="Sat, 13 Sep 2008 18:46:20 GMT", ; rel="next memento"; datetime="Tue, 16 Dec 2008 00:34:03 GMT", ; rel="last memento"; datetime="Sat, 20 Feb 2010 18:10:17 GMT" content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org web-static.archive.org wayback-api.archive.org athena.archive.org analytics.archive.org pragma.archivelab.org wwwb-events.archive.org x-archive-src: 52_5_20080913181721_crawl101-c/52_5_20080913184550_crawl100.arc.gz server-timing: captures_list;dur=0.460701, exclusion.robots;dur=0.016855, exclusion.robots.policy;dur=0.008110, esindex;dur=0.009781, cdx.remote;dur=125.365932, LoadShardBlock;dur=328.752579, PetaboxLoader3.resolve;dur=352.489233, PetaboxLoader3.datanode;dur=131.332146, load_resource;dur=231.590277 x-app-server: wwwb-app216 x-ts: 200 x-tr: 817 server-timing: TR;dur=0,Tw;dur=0,Tc;dur=0 x-location: All x-rl: 0 x-na: 0 x-page-cache: MISS server-timing: MISS x-nid: DigitalOcean referrer-policy: no-referrer-when-downgrade permissions-policy: interest-cohort=() content-encoding: gzip Hackszine.com: Data Archives

Archive: Data

Page 1 of 2 1 2 >>

September 6, 2008

Write a Hadoop MapReduce job in any programming language

Hadoop is a Java-based distributed application and storage framework that's designed to run on thousands of commodity machines. You can think of it as an open source approximation of Google's search infrastructure. Yahoo!, in fact, runs many components of its search and ad products on Hadoop, and it's not too surprising that they are a major contributor to the project.

MapReduce is a method for writing software that can be parallelized across thousands of machines to process enormous amounts of data. For instance, let's say you want to count the number of referrals, by domain, in all the world's Apache server logs. Here's the gist of how you'd do it:

Get all the world to upload their server logs to your gigantor distributed file system. You might automate and approximate this by having every web administrator add some javascript code to their site that causes their visitor's browsers to ping your own server, resulting in one giant log file of all the world's server logs. Your filesystem of choice is HDFS, the Hadoop Distributed Filesystem, which handles partitioning and replicating this enormous file between all of your cluster nodes.
Split the world's largest log file into tiny pieces, and have your thousands of cluster machines parse the pieces, looking for referrers. This is the "Map" phase. Each chunk is processed and the referrers found in that chunk are output back to the system, which stores the output keyed by the referrer hostname. The chunk assignments are optimized so that the cluster nodes will process chunks of data that happen to be stored on their local fragment of the distributed file system.
Finally, all the outputs from the Map phase are collated. This is called the "Reduce" phase. The cluster nodes are assigned a hostname key that was created during the Map phase. All of the outputs for that key are read in by the node and counted. The node then outputs a single result which is the domain name of the referrer, and the total number of referrals that were produced from that referrer. This is done hundreds of thousands of times, once for each referrer domain, and distributed across the thousands of cluster nodes.

At the end of this hypothetical MapReduce job, you're left with a concise list of each domain that's referred traffic, and a count of how many referrals it's given. What's cool about Hadoop and MapReduce is that it makes writing distributed applications like this surprisingly simple. The two functions to perform the example referrer parsing might only be about 20 lines of code. Hadoop takes care of the immense challenges of distributed storage and processing, letting you focus on your specific task.

Since Hadoop is written in Java, the natural way for you to create distributed jobs is to encapsulate your Map and Reduce functions into a java class. If you're not a Java junkie, though, don't worry, there's a job wrapper called HadoopStreaming which can communicate with any program you write with the usual STDIN and STDOUT. This lets you write your distributed job in Perl, Python or even a shell script! You create two programs, one for the mapper and one for the reducer, and HadoopStreaming handles uploading them to all of the cluster nodes and passing data to and from your programs.

If you want to play around with this, I really recommend a couple of howtos written by German hacker Michael G. Noll. He put together a walkthrough for getting Hadoop up and running on Ubuntu, and also a nice introduction to writing a MapReduce program using HadoopStreaming (with Python as an example).

Are any Hackszine readers using Hadoop? Let us know what you're doing and point us to more information in the comments.

Hadoop
Running Hadoop On Ubuntu Linux
Writing An Hadoop MapReduce Program In Python

September 5, 2008

Read Excel files in Perl and PHP

Relational databases that speak SQL are the data-storage backbone for most developers. Unfortunately, but most of the data that's created outside the control of the technology caste at a typical workplace is in Excel format. Because of this, being able to procedurally read and write Excel documents with a familiar language can open up a whole world of possibilities for automation and data migration.

Assuming you're attempting to read and write standard text (Ie. not binary/graphic) data from Excel worksheets, this is actually fairly doable in PHP and Perl.

A recent article by Mike Diehl at Linux Journal peaked my interest in this. He shows off some of the features of the Spreadsheet::ParseExcel Perl module, which can be used to pull data and even formatting information from cells in an Excel worksheet. Once you have your hands on the data, you can do what you want with it: output it to XML, toss it in a database for subsequent querying, or even convert it into other Excel documents (oh, the shame).

Perl Excel Libraries and Information
Spreadsheet:ParseExcel - Read from Excel 95/97/2000 documents
Spreadsheet:WriteExcel - Write to Excel 97/2000/2002/2003 documents
Linux Journal - Reading Native Excel Files in Perl

There are libraries for dealing with native Excel files in PHP as well. The following two seem to be the only options for binary Excel documents.

PHP Excel Libraries
PHP Excel_Reader - Read Excel 95 and 97 documents
Spreadsheet_Excel_Writer - Write Excel 5.0 documents
Reading and Writing Spreadsheets with PHP

With the most recent version of Excel, there is an XML file format option that will allow you to read and write data in a worksheet by directly interacting with the saved file's DOM. IBM has a document that details doing this with PHP, and it would be straightforward to apply this technique to Perl as well.

Read/Write XML Excel Data in PHP

Finally, if all you need to do is output a document that can be read in Excel, a standard CSV-format file will usually do the trick. Escaping can be a bit tricky, however, and my preferred format has become a plain-old HTML table. Just create a file that contains a TABLE element (no BODY or HTML tags necessary), with any number of TR rows and html-escaped data in the TDs, and save it out. If you use the XLS file extension, it will open directly in Excel with a double-click and Excel never seems to mind reading in the data.

Do you have any other Excel programming hacks? Give us a shout in the comments.

August 6, 2008

Memcached and high performance MySQL

Memcached is a distributed object caching system that was originally developed to improve the performance of LiveJournal and has subsequently been used as a scaling strategy for a number of high-load sites. It serves as a large, extremely fast hash table that can be spread across many servers and accessed simultaneously from multiple processes. It's designed to be used for almost any back-end caching need, and for high performance web applications, it's a great complement to a database like MySQL.

In a typical environment, a web developer might employ a combination of process level caching and the built-in MySQL query caching to eke out that extra bit of performance from an application. The problem is that in-process caching is limited to the web process running on a single server. In a load-balanced configuration, each server is maintaining its own cache, limiting the efficiency and available size of the cache. Similarly, MySQL's query cache is limited to the server that the MySQL process is running on. The query cache is also limited in that it can only cache row results. With memcached you can set up a number cache servers which can store any type of serialized object and this data can be shared by all of the loadbalanced web servers. Cool, no?

To set up a memcached server, you simple download the daemon and run it with a few parameters. From the memcached web site:

First, you start up the memcached daemon on as many spare machines as you have. The daemon has no configuration file, just a few command line options, only 3 or 4 of which you'll likely use:
# ./memcached -d -m 2048 -l 10.0.0.40 -p 11211

This starts memcached up as a daemon, using 2GB of memory, and listening on IP 10.0.0.40, port 11211. Because a 32-bit process can only address 4GB of virtual memory (usually significantly less, depending on your operating system), if you have a 32-bit server with 4-64GB of memory using PAE you can just run multiple processes on the machine, each using 2 or 3GB of memory.

It's about as simple as it gets. There's no real configuration. No authentication. It's just a gigantor hash table. Obviously, you'd set this up on a private, non-addressable network. From there, the work of querying and updating the cache is completely up to the application designer. You are afforded the basic functions of set, get, and delete. Here's a simple example in PHP:

$memcache = new Memcache; $memcache->addServer('10.0.0.40', 11211); $memcache->addServer('10.0.0.41', 11211);
$value= "Data to cache";

$memcache->set('thekey', $value, 60);
echo "Caching for 60 seconds: $value <br>\n";

$retrieved = $memcache->get('thekey');
echo "Retrieved: $retrieved <br>\n";

The PHP library takes care of the dirty work of serializing any value you pass to the cache, so you can send and retrieve arrays or even complete data objects.

In your application's data layer, instead of immediately hitting the database, you can now query memcached first. If the item is found, there's no need to hit the database and assemble the data object. If the key is not found, you select the relevant data from the database and store the derived object in the cache. Similarly, you update the cache whenever your data object is altered and updated in the database. Assuming your API is structured well, only a few edits need to be made to dramatically alter the scalability and performance of your application.

I've linked to a few resources below where you can find more information on using memcached in your application. In addition to the documentation on the memcached web site, Todd Hoff has compiled a list of articles on memcached and summarized several memcached performance techniques. It's a pretty versatile tool. For those of you who've used memcached, give us a holler in the comments and share your tips and tricks.

Memcached
Strategies for Using Memcached and MySQL Better Together
Memcached and MySQL tutorial (PDF)

August 4, 2008

Shield your files with Reed-Solomon codes

Thanassis Tsiodras wrote in about a utility for adding additional error correction redundancy to your backup data:

The way storage quality has been nose-diving in the last years, you'll inevitably end up losing data because of bad sectors. Backing up, using RAID and version control repositories are some of the methods used to cope ; here's another that can help prevent data loss from bad sectors. It is a software-only method, and it has saved me from a lot of grief.

The technique uses Reed-Solomon coding to add additional parity bytes to your data. If you suffer partial damage to the storage media, these files can still be recoverable.

Storage media are of course block devices, that work or fail on 512-byte sector boundaries (for hard disks and floppies, at least - in CDs and DVDs the sector size is 2048 bytes). This is why the shielded stream must be interleaved every N bytes (that is, the encoded bytes must be placed in the shielded file at offsets 1,N,2N,...,2,2+N,etc): In this way, 512 shielded blocks pass through each sector (for 512 byte sectors), and if a sector becomes defective, only one byte is lost in each of the shielded 255-byte blocks that pass through this sector. The algorithm can handle 16 of those errors, so data will only be lost if sector i, sector i+N, sector i+2N, ... up to sector i+15N are lost! Taking into account the fact that sector errors are local events (in terms of storage space), chances are quite high that the file will be completely recovered, even if a large number of sectors (in this implementation: up to 127 consecutive ones) are lost.

The application works similar to any other command line archiving utility, so you can tar your files as normal and then send them to the freeze.sh script. Running melt.sh on the archive will return your original data, even if there was a reasonable amount of corruption to the file. Thanks, Thanassis!

Hardening your files with Reed-Solomon codes

July 23, 2008

NTFS Alternate Data Streams - hide files inside other files

The NTFS file system has support for additional data, called Alternate Data Streams (ADS), to be attached to any file. Normally this is used by the operating system and file explorer to bind extra data to a file, such as the file's access control information, searchable file meta-data like keywords, comments and revision history, and even information that can mark a file as having been downloaded from the internet. Because this extra information is bound to the file at the filesystem level, you can move the file from one folder to another and all of the various meta-information and permission data stays with the file.

The interesting thing is that a file can have 0 to many ADS forks attached to any file or directory. While some of the ADS identifiers are use by the OS, there's nothing stopping you from adding other ADS forks to a file. You can do this directly from the command line, using a simple colon ":" notation.

Let's say you have a file called test.txt. You can store a secret message in the file like this:
echo "This is a secret" > test.txt:secretdata

If you view the contents of the file, you won't see anything peculiar. If you know about the existence of the secretdata ADS entry, however, you can easily extract the hidden information with the following command:
more < test.txt:secretdata > output.txt

When you now open output.txt, you'll find your secret data inside.

Because it's a lower level OS feature, you can even trick most programs into loading the data. In the scenario above, you could actually load and edit the secretdata stream inside of notepad by running "notepad test.txt:secretdata".You can even store and execute binary data of any particular size in an ADS fork. For instance, maybe you want to shove solitaire inside one of your text file's ADS entries:

type c:\winnt\system32\sol.exe > test.txt:timewaster.exe

Running the file is as simple as "start .\test.txt:timewaster.exe". Wild, no?

So the odd thing is that all these hidden streams are floating about your filesystem and until Vista's /R flag on the DIR command, there hasn't really been a very good built-in way of detecting them. To solve this, Frank Heyne created an application called LADS which is an excellent command line utility that will scan a directory and print out stream names and sizes for files within it.

There's was also a tool released in an MSDN article about file streams that will at an extra tab to the file properties in Windows Explorer. I've linked to a FAQ that Frank maintains about ADS that walks you through setting up the dll and registry entries to make this work. When it's activated, the Streams tab in the properties panel will let you create, view, edit or delete the stream data that's attached to any file, right in Explorer.

I can see how this file system feature could be useful, but it's a little odd that it's so hidden from the user and there seem to be a few problems with the concept. Obviously, because of ADS's hidden nature, there are a number of malicious uses that can be employed by jerk-o's who write virii and that sort of thing. Even ignoring that, there are also data interchange issues—moving a file between NTFS and another file system causes the loss of all this attached information. Call me old fashioned, but I like my files the way they used to be, with a start, an end, and some bytes in between.

Frank Heyne - Alternate Data Streams in NTFS FAQ
LADS - NTFS alternate data stream list utility
The Dark Side of NTFS
MSDN: A Programmer's Perspective on NTFS Streams and Hard Links

July 15, 2008

When to denormalize

There's been a bit of a database religious war on Dare Obasanjo and Jeff Atwood's blogs, all on the subject of database normalization: when to normalize, when not to, and the performance and data integrity issues that underly the decision.

Here's the root of the argument. What we've all been taught regarding database design is irrelevant if the design can't deliver the necessary performance results.

The 3rd normal form helps to ensure that the relationships in your DB reflect reality, that you don't have duplicate data, that the zero to many relationships in your system can accommodate any potential scenario, and that space isn't wasted and reserved for data that isn't explicitly being used. The downside is that a single object within the system may span many tables and, as your dataset grows large, the joins and/or multiple selects required to extract entities from the system begins to impact the system's performance.

By denormalizing, you can compromise and pull some of those relationships back into the parent table. You might decide, for instance, that a user can have only 3 phone numbers, 1 work address, and 1 home address. In doing so, you've met the requirements of the common scenario and removed the need to join to separate address or contact number tables. This isn't an uncommon compromise. Just look at the contacts table in your average cell phone to see it in action.

Jeff writes:

Both solutions have their pros and cons. So let me put the question to you: which is better -- a normalized database, or a denormalized database?

Trick question! The answer is that it doesn't matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n.

So for large n, what's the solution? In my personal experience, you can usually have it both ways.

Design your database to 3NF from the beginning to ensure data integrity and to allow room for growth, additional relationships, and the sanity of future querying and indexing. Only when you find there are performance problems do you need to think about optimizing. Usually this can be accomplished through smarter querying. When it cannot, you derive a denormalized data set from the normalized source. This can be as simple as an extra field in the parent table that derives sort information on inserts, or it can be a full-blown object cache table that's updated from the official source at some regular interval or when an important even occurs.

Read the discussions and share your comments. To me, the big takeaway is that there's no one solution that will fit every real world problem. Ultimately, your final design has to reflect the unique needs of the problem that is being solved.

When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal

July 5, 2008

Crawling AJAX

Traditionally, a web spider system is tasked with connecting to a server, pulling down the HTML document, scanning the document for anchor links to other HTTP URLs and repeating the same process on all of the discovered URLs. Each URL represents a different state of the traditional web site. In an AJAX application, much of the page content isn't contained in the HTML document, but is dynamically inserted by Javascript during page load. Furthermore, anchor links can trigger javascript events instead of pointing to other documents. The state of the application is defined by the series of Javascript events that were triggered after page load. The result is that the traditional spider is only able to see a small fraction of the site's content and is unable to index any of the application's state information.

So how do we go about fixing the problem?

Crawl AJAX Like A Human Would
To crawl AJAX, the spider needs to understand more about a page than just its HTML. It needs to be able to understand the structure of the document as well as the Javascript that manipulates it. To be able to investigate the deeper state of an application, the crawling process also needs to be able to recognize and execute events within the document to simulate the paths that might be taken by a real user.

Shreeraj Shah's paper, Crawling Ajax-driven Web 2.0 Applications, does a nice job of describing the "event-driven" approach to web crawling. It's about creating a smarter class of web crawling software which is able to retrieve, execute, and parse dynamic, Javascript-driven DOM content, much like a human would operate a full-featured web browser.

The "protocol-driven" approach does not work when the crawler comes across an Ajax embedded page. This is because all target resources are part of JavaScript code and are embedded in the DOM context. It is important to both understand and trigger this DOM-based activity. In the process, this has lead to another approach called "event-driven" crawling. It has following three key components
Javascript analysis and interpretation with linking to Ajax
DOM event handling and dispatching
Dynamic DOM content extraction

The Necessary Tools
The easiest way to implement an AJAX-enabled, event-driven crawler is to use a modern browser as the underlying platform. There are a couple of tools available, namely Watir and Crowbar, that will allow you to control Firefox or IE from code, allowing you to extract page data after it has processed any Javascript.

Watir is a library that enables browser automation using Ruby. It was originally built for IE, but it's been ported to both Firefox and Safari as well. The Watir API allows you to launch a browser process and then directly extract and click on anchor links from your Ruby application. This application alone makes me want to get more familiar with Ruby.

Crowbar is another interesting tool which uses a headless version of Firefox to render and parse web content. What's cool is that it provides a web server interface to the browser, so you can issue simple GET or POST requests from any language and then scrape the results as needed. This lets you interact with the browser from even simple command line scripts, using curl or wget.

Which tool you use depends on the needs of your crawler. Crowbar has the benefit of being language agnostic and simple to integrate into a traditional crawler design to extract page information that would only be present after a page has completed loading. Watir, on the other hand, gives you deeper, interactive access to the browser, allowing you to trigger subsequent Javascript events. The downside is that the logic behind a crawler that can dig deep into application state is quite a bit more complicated, and with Watir you are tied to Ruby which may or may not be your cup of tea.

Crowbar - server-side headless Firefox
Watir - browser remote control in Ruby
Crawling Ajax-driven Web 2.0 Applications (PDF)

June 24, 2008

Videos from past Shmoocons

You may have dug the videos of past DEFCON conferences that we posted back in May, but there's a whole other infosec conference, Shmoocon, which is held in D.C. every February.

ShmooCon is an annual East coast hacker convention hell-bent on offering three days of an interesting atmosphere for demonstrating technology exploitation, inventive software & hardware solutions, and open discussions of critical infosec issues.

It's a while until the next conference comes up, but there have been some great presentations at past conferences, most of which are available online. Peteris Krumins recently assembled links to all of the videos and presentation files that are available at the Shmoocon site (including the 2008 conference), posting them to his blog as a single big index.

A quick search on YouTube also turned up a series of videos by Scott Moulton from Shmoocon 2007 and 2008 on the topic of data recovery for both traditional hard disks and flash drives. It's pretty fascinating stuff, whether you're interested in this from a forensics or security perspective, or if you've ever just wondered what exactly goes into recovering important data from a crashed disk when you send it out to a data recovery shop.

Hacking Videos from Shmoocon
Scott Moulton's videos on data recovery for SSD flash drives and hard disks
Shmoocon Infosec Conference

drop.io - simple anonymous file sharing

Sometimes I need to send files to people that are too large to attach to an email. Inevitably, the solution is to upload it to an ftp or web server that I have access to and then send the recipient a download url. It's a pretty inefficient process, and unless you like your ftp server becoming an overwhelming mess of random downloads, you have to remember to go back and remove things at a later date.

drop.io is a web service that solves this sort of problem perfectly. You create a drop URL with a unique name, upload a file to it, and set an expiration time when it will be deleted, all in a single step. The drop folder can have both an access and an admin password, and you can choose what level of access (read, read/write, read/write/delete) the non-admin has. After you've created a drop folder, you can continue to add files and notes to it via the web interface or by email. Each drop also has a phone extension that will allow you to call in and record messages that are added to the drop. It's brilliantly simple.

What I like best is that aside from tracking IP for legal or terms of service violations, it's completely anonymous. You don't make an account to use the service. There is no profile. The drop folders aren't search indexable unless you choose to make them without passwords and publish the URL somewhere crawlable. You can renew the expiration period of the drop, but when it expires, it goes away along with its contents.

I like.

drop.io - Simple Private Exchange

May 9, 2008

Processing.js - visualization library for Javascript

John Resig, of jQuery fame, released a port of the Processing visualization language for Javascript. Seriously, John is on fire:

The first portion of the project was writing a parser to dynamically convert code written in the Processing language, to JavaScript. This involves a lot of gnarly regular expressions chewing up the code, spitting it out in a format that the browser understands.
It works "fairly well" (in that it's able to handle anything that the processing.org web site throws at it) but I'm sure its total scope is limited (until a proper parser is involved). I felt bad about tackling this using regular expressions until I found out that the original Processing code base did it in the same manner (they now use a real parser, naturally).

The full 2D API is implemented, with the exclusion of some features here and there between browsers (Firefox 3 is pretty full featured). You can interact with the Processing API directly from standard Javascript. This lets you make use of these drawing features by simply instantiating a Processing object, and then calling its various drawing methods.

Another capability is to write code natively in the Processing language. This allows you to make use of extended language features such as method overloading and classic inheritance, though it looks like type information is pretty much ignored.

John has many of the demos from processing.org working. Most of them are going to peg your CPU, but this is some seriously cool stuff to see working in a first release.

Javascript just got a lot more interesting.

Processing.js
Processing: open source data visualization language

March 17, 2008

CryoPID: hibernation for Linux processes

We're all familiar with the hibernate/deep-sleep features that are typical on your standard laptop. In this mode, the entire contents of RAM are written to the disk and the machine is completely shut down. When it's next booted, the system is restored to the exact state it was at before sleep, with all of your programs running just like they were when you left them.

What if you could do this at the process level? You could kill whatever umpteen-gazillion applications you have running, reboot your computer, and then start your apps back up whenever you like and they would be exactly the way they were when you left them.

There's a Linux application called CryoPID which attempts to do just that.

CryoPID requires no special kernel modifications and operates in user mode, so you don't need to be root. All you do is run the freeze program on a process you own:

freeze /tmp/savestatefile 1234

This will archive the state of process 1234 into a self-executing, compressed file named /tmp/savestatefile. To start it back up, just run the save file:

/tmp/savestatefile

When this is executed, your application will be restored, relinked to any previously-loaded DLLs, and attached to the file descriptors it had open.

You'll run into some problems with network socket connections you had open, and support for X applications is still only experimental, so the useful scenario is a bit limited, but it's a promising concept and could come in quite handy in the command-line world.

CryoPID - A Process Freezer for Linux

March 4, 2008

Ram dump over Firewire

Unlike USB2, the Firewire spec allows devices to have full DMA access. By impersonating the appropriate device, a PC can essentially obtain full read/write access to another machine's RAM, just by connecting the two machines with a Firewire cable. Adding to the recent discussion about the insecurities of physical access and Princeton's cold-boot RAM dump demonstration, Adam Boileau released a Linux Firewire utility that will give you immediate Administrator to an XP machine:

It's two years later, and I think anyone who was going to get the message about Firewire has already got it, and anyone who was going to be upset about it has got over it. Besides, according to Microsoft's definition, it never was a Security Vulnerability anyway - screensavers and login prompts are - as Bruce says - about the Feeling of Security. Anyway, today's release day for Winlockpwn, the tool I demoed at Ruxcon for bypassing windows auth, or popping an admin shell at the login window.
...

Yes, you can read and write main memory over firewire on windows.
Yes, this means you can completely own any box who's firewire port you can plug into in seconds.
Yes, it requires physical access. People with physical access win in lots of ways. Sure, this is fast and easy, but it's just one of many.
Yes, it's a FEATURE, not a bug. It's the Fire in Firewire. Yes, I know this, Microsoft know this. The OHCI-1394 spec knows this. People with firewire ports generally dont.

Adam's tools include a few Python apps that can copy and impersonate Firewire device signatures, dump RAM on a remote machine, bypass Windows authentication, and extract BIOS passwords. It's not exactly comforting, but I've got a new appreciation for Firewire now. This is the sort of access that used to only be possible by creating hardware that physically connects to the PCI bus. Now all you need is a cable and a laptop.

Firewire, DMA & Windows - direct memory access over Firewire - [via] Link

March 1, 2008

Recover data from RAM after a crash

After Princeton's cold-boot encryption key recovery hack, I got to thinking about what other useful things might be lying around in memory. It's old news that passwords of logged-in users are hanging out in there, but what about something more useful to the everyday user? What about that file you were editing before accidentally closing its window without saving?

In Linux and on PPC Macs, the root user can access the machine's ram through the /dev/mem device. I'm not sure why this is unavailable on newer Intel Macs—it's a bummer.

In theory, if you're processing some words, spreading sheets, or posting a blog entry and your program crashes, it's likely that the data you were editing will still be in RAM, unharmed, waiting to be allocated to another process. If you immediately dump the entire contents of RAM to disk before starting another large process, chances are good you can find your data again. It's tricky though—writing that RAM to disk requires you start up at least one process, such as dd. It's possible that this new process, or a another process that's currently running, could allocate memory and obliterate your file. You don't really have other options, though, so you might try something like this:

dd if=/dev/mem of=/tmp/ramdump strings /tmp/ramdump | grep "some text in your file"

I found a post by David Keech where he describes exactly this process. He was able to use it to successfully recover the text from a killed vi session:

I tested this by starting vi and typing in "thisisanabsolutelyuniqueteststring", killing the vi process without saving the file and running the command above immediately with a small modification. Instead of piping the output to a file, I piped it to grep thisisanabsolutelyuniquetest. The grep command found itself, as it always does, but it also found the original string, identified by the rest of the unique string that I didn't include in the grep command. You have to be careful when search through running memory. I now remember having this problem with the Mac all those years ago. Whenever I searched for parts of my brother's letter, I would just end up finding the part of memory that contained the search string.

He also mentions scanning the swap partition, which is also a likely place for your data to be found. It's the same process, but you replace /dev/mem with /dev/hda2 or whatever your swap partition is.

Here's the fun part. Based on what we now know about DRAM holding data even a few seconds of being unpowered, you might even be able to use the method to recover program data after a full system crash and reboot. The swap data will for sure be there, but if you reboot into single user mode without starting up X or any large applications, the possibility exists that unallocated areas of /dev/mem will still contain data from before the reboot.

How to recover your data after a crash - Link
Extracting encryption keys after a cold boot - Link

February 6, 2008

TrueCrypt for OS X

TrueCrypt 5.0 was released yesterday and OS X has been added to the list of supported operating systems, making it the only open source volume encryption utility that works in Linux, Mac and Windows. It's a really slick utility for creating an AES-256 or Serpent encrypted volume that you can drop sensitive files inside.

You can use TrueCrypt to create an encrypted volume image inside a file, or you can encrypt a whole disk image or partition. The OS X version uses MacFUSE to provide user-mode mounting of the encrypted disk. The main application window, pictured above, gives you a simple interface for creating and mounting encrypted images.

Once an image is mounted, you can use it like a normal hard disk. Unmount the disk and you're left with a file full of random gibberish. FAT is the only filesystem that's available through the interface, but once the disk is mounted, you can reformat it with Disk Utility to use XFS.

There are a couple of things worth noting. In the Windows and Linux versions a special bootloader is available that lets you encrypt your entire system drive. It doesn't look like that option is available in the OS X version. Also, when I tested the latest OS X binary this evening, the "hidden volume" plausible deniability feature wasn't working. Hopefully that will be added in a future release. Until then, TrueCrypt is better suited for storing tax documents and things you wouldn't want visible to a laptop thief, rather than the details of where you've hidden the bodies.

TrueCrypt - [via] Link

December 7, 2007

Hacking the Western Digital MyBook World Edition

Western Digital sells a number of external drives under the MyBook World Edition brand. These are network-based external storage drives that you can connect to remotely from multiple machines. Inside are a couple of drives set up in a mirrored RAID configuration, as well as an embedded computer running Linux.

MakeFan: tipped us off to Martin Hinner's website, which has a lot of details about the software running on the MyBooks, including info for hacking the devices capabilities to do more than what's available out of the box.

This page provides information on how to hack your MyBook World Edition, so as you can improve performance and add new features. MyBook is powered by ARM9 microprocessor, it has 32MB of SDRAM and boots from internal hard drive. The system partition has 2.8GB (only 260 MB is occupied). This means that you have a lot of resources for various improvements.

You can enable SSH on the device without cracking the case. Martin hosts a script that subverts the firmware update software to create your ssl keys and boot the sshd process. Once that is enabled, you have full access to the OS to do what you like, including running an NFS server, web server, or even replacing the standard web interface.

Also worth checking out is the Hacking WD MyBook Wiki. They have links to information on rescuing data from dead drives and building other software for the device. Keep in mind that building MySQL from source will take about 18 hours, but there's got to be something fun you can do with a LAMP stack running on a terabyte hard drive.

Hacking Western Digital MyBook World Edition - Link
MyBook World Edition Wiki - Link

Page 1 of 2 1 2 >>