Monthly Archives: February 2014

MapReduce for Clearing Clutter

My desk is cluttered.  Some would call it a train wreck.  Some might even feel terrible about it being so cluttered.  I am one of those people.  But I’ve let it slide because of the priorities in my life; family, work, and personal health take precedent over battling a chaotic desk.

Of course, everything on it is there because I didn’t have time to deal with it in the first place.  But as I have transitioned out of school and into the working world, my life has become more routine, with more free time. And my desk has been taunting me.  It calls me names when I walk by, and earlier this week it started a war when it tried to dump a kitchen knife prototype on my foot.  The line had been crossed.

I dove right into the problem in ‘Naive Desk Clearing’ mode and soon felt overwhelmed.  I needed a strategy, and in a flash I decided my giant cluttered desk was a clustering problem.  Before me lay a giant pile of unstructured data.  There were distinct categories of stuff, each of which required a different thought process to deal with.  So trying to just iterate through the pile would have me context switching with each Desk Object, and thus wasting lots of time. And since I’ve been working with Hadoop at work, it seemed like an interesting way to tackle this real world problem. As they say, if all you have is a hammer, then every problem becomes a nail.

Abstracting a monoid from a sea of random stuff on a desk is tough, but seeing it as a clustering problem came fromthinking in terms of the attributes of the Desk Objects as what is being processed and not the Objects themselves.  Attributes -> Features -> Feature Sets -> Vectors -> K-means clusters. With my mind in feature set mode, it was time to do some mapping.

Mapping and Reducing the Desk Objects:

I start by mentally chunking out sections of the desk. Next, I process each chunk and score each object in the chunk mentally and put it into a pile based on its highest scored feature.  This is where my single processor humanity was at odds with MapReduce.  If I were a cluster, I would score all the objects in one chunk of Desk Objects while others scored the other chunks; then we’d switch gears, shuffle up our objects, do some clustering on each new chunk, then try to combine our chunks of data.  But I’m not a cluster of computers, so I put all the financial docs in one pile, kid’s art in another, electronics in another etc, in one step; and then moved onto the next chunk of desk space.

In the end it was more of an iterative approach because I couldn’t parallelize the process, but seeing the problem in terms of MapReduce helped me get past the overwhelming boredom that comes with a mundane task.

And pulled from the chaos was an old Seahawks belt, just in time for the Superbowl.