Big Data Needs Variables

Many concepts from Mathematics, Computer Science, and programming should be leveraged to improve our social/market interactions. One such concept is the variable.

You may know one from algebra as x. x = 3; x + y = 7; y = 4. But now x = 11, which means x + y = 7; y = -4.

A sensible, successful information society depends on proper segregation and apportionment of data. But you wouldn’t know it based on the governments’ and corporations’ attitudes towards our data.

What do I mean by segregation of data? I mean that certain information is need-to-know. For example, a corporation does not necessarily need to know my physical address, my e-mail address, my phone number, my date-of-birth, et cetera.

Why do they ask for these things, then? Because they don’t have an alternative choice.

Why should they want an alternative? Look at my examples above and you will notice that except for date-of-birth, they can and will all change from time to time.

What they should want, in lieu of an e-mail address or credit card or other billing data: a variable.

A credit card’s processing information is sort of like a variable. You can pay it off with cash (if the issuing bank is local to you), check, other credit, et cetera. But in its use, it has become largely become a value in itself. It expires, and knowledge of it is treated as authorization to charge to it.

A variable names a piece of ephemeral data. You can e-mail me at variable@variable.invalid (which might as well be programmatically generated for our purposes), and that can then point to my current e-mail.

A strong variable system can mean that I control the value on an ongoing basis, while depending parties don’t have to worry about me updating their copy of my data when it changes.

Have you ever changed e-mail addresses and had to go to umpteen different online accounts to change it? Maybe not if you’re young enough to always have only had one account, but if you’re old enough to have seen your primary e-mail change from, say, hotmail, to gmail, and maybe something else in the future (eg, an employer-based mail or some secure alternative, at least for some uses), you know that pain.

That needless pain, which harms the corporation just as much. Because they see some value in knowing how to contact you, but not enough to recognize the real and profound risk they are placing on themselves by not looking toward a variable-based solution.

Okay, but I mentioned something about apportionment of data. What’s that? The data should have a home, and maybe a vacation place or safe house. But it should not live everywhere. A thousand copies of data that do not follow ACID (Atomicity, Consistency, Isolation, Durability) — in this case the C-as-in-Consistency, is begging for pain.

You want data to be properly allocated across the world for security and privacy, too. If you let the data seep all over, that’s a lot of targets to get your information from.

Other benefits include being able to seamlessly transition between services. The next gmail will have a harder time making a dent in the market, when everyone has to change their services to point to their new shinymail address.

The benefits of variables currently get ignored by big businesses, because they think that their database values outweigh the costs. But my guess is if you look at aging databases, like MySpace or Hotmail, they lost opportunities more than they ever monetized their databases.

Near-future-diving the Internet

We all know what the browser and web and Internet experience is like today.  There are some great things already happening.  But there are also some lousy things we have to deal with that we hopefully won’t forever.

One example of the bad comes in the form of passwords/the current sign-up user experience (UX).  Along a similar vein of grief ore is the various CAPTCHA systems employed to ask commentators why they aren’t helping the tortoise on its back, baking in the hot sun, in order to evoke an emotional response.

There are three areas I’ll examine today:

  1. Data extraction and manipulation
  2. Resource management
  3. Discussion defragmentation

Data extraction and manipulation

Let’s say you come across a series of dates, like the list of earthquakes at Wikipedia: List of major earthquakes: Largest earthquakes by magnitude. You ponder quietly to yourself, “I wonder what a really napkinny mean of the time between those is?”

I happened to have that very experience, so I:

  1. Opened the python REPL (Read Eval Print Loop, an interactive session for a dynamic language)
  2. imported the datetime module
  3. Hand-created an array of those dates
  4. Sorted the array
  5. Mapped them to an array (size n-1) containing the deltas
  6. Summed that array and divided by n-1
  7. Asked the result what that was in days (a_timedelta.days)

But that was a lot of work, and only for a very simple property of the data. It could have been made easier if I hadn’t hand-created the array of dates. That would’ve been a matter of either copy-pasting the dates into python or into a file and reading them, converting them into dates.

We can say with some certainty that looking at the average gap between a series of dates is a common operation. Can we say that in the near-future of the Internet we would like that kind of operation to become trivial for anyone to execute?

Resource management

Here the resource stands for the R in URI. You are probably dealing with resources all the time. Depending on your use patterns, you may have a hundred tabs open in your browser right now. You may have tons of bookmarks. But even if not, you still have to manage resources.

You also manage them in the more traditional sense, of how many tabs you can have, and how long you’re willing to look for one. How many bookmarks, versus how many of them you’ll ever actually use.

The question for the near-future is how much smarter we can make the tools we use to manage the resources. Tab groups (formerly called tab candy and panorama) in Firefox lets you create groups of tabs. The search feature of Tab groups undoubtedly does more, especially coupled with the Switch to tab feature of the awesomebar.

Part of the difficulty in the explosion of resources is that of finding things. You might take ten tabs to find the one you want, but the other nine may sit idle, requiring attention to be removed.

This difficulty even infects the otherwise superb awesomebar: when you start typing in order to find a resource, it’s often the case that the search items remain higher in the awesomebar results than the resource(s) they led to. That’s plausibly fixable, but does require some kind of recognition of the content of the pages in the history.

That points to a potentially important distinction for the future of the web: a separation between the activity of searching and using resources. Often pages that aren’t directly searches are still part of the search activity rather than the use activity.

Discussion defragmentation

This was prompted by recent discussion on the Mozilla discussion lists regarding how accessible said lists are. Their lists are currently available in three forms: newsgroups, via the Google Groups web interface, and as mailing lists. The lists serve as one of the main discussion formats used for the Mozilla community.

But other discussions of that community (which is much alike the rest of the free software and open source communities in this regard) occur scattered among many blogs across the web. Still others occur on IRC (Internet Relay Chat), which may or may not be logged and made available on the web. And then even more occurs on bugs in a bug tracker.

So we see fragmentation of discussion. But the other half of the story is where discussions will migrate away from their original topic, spawning new discussion. In the case of bug trackers, the commentary may consist of a single thread, but most other forums allow for threaded discussions.

Each of these forms is a tree, so some sort of tree unification would be required to defragment these discussions, and to allow proper splitting off of newly developed topics. It’s harder to envision for subtrees that occur off of the web, but it’s conceivable those parts could be imported to various web-based servings in order to include them.

The challenge to building the full trees are knowing where the upstream discussion is (if any) and where the downstream discussion(s) are (if any). But this is standard tree stuff. The downstream discussions can quite easily point to upstream, and they can also send something like a trackback to upstream so it can collect the downstream locations.

What that might look like (using a pseudodocument for ease of example):

< !DOCTYPE htmlm>
  <messages parent="">
    <message id="290a4ec0-8672-11e1-a4e2-001fbc092072" time="[TIMESTAMP]">
      Hello, world!
    <message id="ca0bfdc8-8672-11e1-b2d3-001fbc092072" time="[TIMESTAMP]">
      !dlrow ,olleH

< !DOCTYPE htmlm>
  <messages> <!-- no parent, this is a new root -->
    <message id="6878c1e8-866f-11e1-a574-001fbc092072" time="[TIMESTAMP]">
      Uryyb, jbeyq!

< !DOCTYPE htmlm>
  <!-- Treat the parent as a root, even if it has parents -->
  <messages parent="" root="true">
    <message id="fdf0eec2-8673-11e1-b336-001fbc092072" time="[TIMESTAMP]">
      ¡pꞁɹoʍ 'oꞁꞁǝɥ

One thing this example doesn’t explain is how to make a leaf a root of several new discussions. I’m sure there are other cases like that where things get complicated, like wanting to merge discussions (ie, make it into a graph rather than a tree). One case where that could occur is if someone wants to reply to two messages with a single reply that ties everything up with a nice bow.

But it’s worth thinking about, as the current situation is definitely substandard and improving it can result in a better outcome.

Integrate Everything

Last time I was talking about Reputation Systems.  That plays nicely into my ideas about integrating the world.  This post (obviously I should at some point formalize these ideas, but for now a blog entry will do) was triggered by Cisco’s announcement they want to give your fracking thermometer an IP address.  I read that on Slashdot.

Without further ado:

Someone, please think of the objects!

Your thermometer should not be directly addressable as what amounts to a top-level object.  It should be a node in your bedroom or kitchen.  It may be directly addressable, but for many tasks it makes more sense to negotiate object operations through an intermediary controller.  This is how many distributed systems work.

I’d really like to see a system where “ad-hoc network-aware object networks” (ANONs) get built.  They would require (in my view) two things:

  1. Controller-enabled units
  2. Announcement-enabled units

Controller units are those that have the ability to be elected or appointed as the directly-addressable node in an area.  These might be things like your computer, your AV receiver, or a dedicated unit.  There would be one in every room, one in your car or in your seat on the bus/train/plane.

Announcement units are any devices that wish to participate in the network.  These are the thermometers, the kerosene lamps, the microwaves, and so on.

This is the basic operation of a distributed system.  Election will be held if there are multiple controllers.  If there is no controller then and only then should minor devices be directly addressable.  There is also the possibility that a phone may operate as a controller in that event.

And the reason I mentioned that this is related to the reputation system is that each device will have a reputation and then each operator will have a reputation.  This is done to help manage access control.

Cisco’s idea of giving everything an IP address could be augmented with this sort of system, but ultimately it is less about the direct networking and more about the devices having the capabilities rolled into them to behave as members in a fully integrated network of objects.  That is the starting point that Cisco and other companies should be taking, rather than (finally) recognizing that objects should be networked and starting with “okay let’s give them an IP address.”

Reputation Systems: An Essay


I take it as fact that a programmatic (as opposed to ad hoc) reputation system will be one of the next major undertakings for the net. To me the question is now just a matter of what that system should look like.


The justification for a reputation is relatively straightforward. Phishing, spam, and other forms of fraud would be greatly diminished. The signal-to-noise ratio would be improved if you could look at a little graphic next to the results on Google and know that people think that site rocks or think that site is lousy. Same thing for news sites like Digg and Slashdot. For that random e-mail: you don’t know if it’s worth looking at, but it’s signed. What do others think of that person? They think that person is spam? Delete.

Existing Systems

Before we proceed with my opinion on that matter, a brief overview of some of the existing systems:

PKI Web of Trust

The PKI “Web of Trust” model is the closest thing to a distributed reputation system that I am aware of. It is directed at validating the keys rather than the reputation of the key’s owner. Having signing as part of a more general system would be useful, though.


eBay and other similar sites have a reputation system for the buyers and sellers. It works okay. It is not distributed (each node sees one piece of data, the rating, and the individual feedback that contributed to that rating). Being centralized means that gaming the system is easier. There is one model of trust, although different people can ascribe different weight to that model.


Slashdot uses a moderation system as well as karma. The more highly you are moderated, the better your karma. It has also introduced a friend/fan/foe/freak system allowing you to give a single designation (neutral/friend/foe) to another user as well as see their designation of you. They also display friend-of-a-fried and foe-of-a-friend data.

This works nice within the bounds of the comment system, but you cannot friend/foe the editors, sites, etc. And it’s exclusive to Slashdot, so you can’t see a Slashdot friend automatically on Digg, for example.


The Domain Name System also has reputation built into it. The dozen-or-so root Nameservers are trusted to provide accurate information. The TLD Nameservers are the same way. You trust that the information is accurate. This is also true of routing tables. The difference here is that DNS suffers from gaming in the form of domain squatters.

A Net-wide Reputation System?

That is what I am proposing. It will not supplant the DNS system or the PKI Web of Trust, but such a system would eventually either integrate with or supplant most other general-purpose systems.

Unique IDs

OpenIDs give you a unique identifier to tag with reputation data and build a graph off of. The same goes for websites. E-mail is a little trickier by the very nature of being able to spoof the from: address. That is, until you augment it with encryption and the e-mails become signed.


My assumption is we can build a graph and more or less crawl that reputation graph to discover how well you might want to trust someone or something from the get-go. You could then add to your graph your own reputation data for that entity, which would affect its reputation for those who give you credence.

And of course, if you found the system’s rating wasn’t accurate you would be able to crawl the graph and prune any parts that weren’t giving you accurate results.


I believe it is time to start work on a reputation system for the internet. I believe the existing technologies can be integrated with a system and that it will benefit the average user of the internet immensely. While I have some more technical ideas of how to do it, I am very interested in hearing the feedback of others. This is particularly true of those working on similar/related problems such as data portability.

Thank you for taking the time to read this, and I hope it will entice some useful ideas about reputations.