Categories
data

Big Data on Small Computers

US motto, e pluribus unum, on the back of a dime
Shows US motto (e pluribus unum) on the reverse of a US dime.

One of the great emerging fields of computing is the use of big data and machine learning. This is a process whereby large datasets actually teach computers to do things like translate text, interpret human speech, categorize images, and so on. The problem with this is, so far, it requires large amounts of data and a lot of computing power.

The paradigm is largely opposed to the types of computing people would prefer to do and use. We would rather not send our voice data out to the Internet or have the Internet always listening or watching us to get these benefits of machine learning. But while the advances in technology will allow for us to crunch the data on smaller devices, it will be difficult to have the corpus of data needed for training and use.

It remains to be seen whether smaller datasets or synthesized datasets (where a large dataset is somehow compressed or distilled into the important parts) will emerge. So how do we get big data in our relatively small computers?

It is likely that the problem will provoke the emergence of more distributed systems, something many have wanted and waited for. Distributed systems or collaborative computing allows your computer(s) to participate in computing larger datasets. Projects like the Search for Extra-Terrestrial Intelligence (SETI) have used such distributed computing for over a decade.

The main challenge will be finding ways to break up data to send to the distributed system that protect privacy. That is, if you send the whole voice capture to the distributed system (as you do, AFAIK, with cloud services like Apple’s Siri), you risk the same privacy issues as with the cloud model.

Instead, it should be possible to break up inputs (video or audio) and send portions (possibly with some redundancy, depending on e.g., if word breaks can be determined locally) to several systems and let them each return only a partial recognition of the whole.

It also remains to be seen whether this piecemeal approach will be as functional as the whole-system approach in all cases. While this splitting undoubtedly takes place in whole-systems like Siri, the reassembly and final processing surely takes place over the whole input. That final step may not be easily managed over a distributed system, at least not while protecting privacy.

Consider asking, “what is the time in Rome?” which might be processed as slightly off, due to pronunciation, “what is the dime in Rome?” In a whole-system approach it’s likely easier to infer dime → time at some late step, rather than if each hands back a partial result and the final recipient has less knowledge of how it was made. In a question case like that, the final text is likely targeted to a search engine, which will correct (though it could take the question literally and say, “It is the €0.10 coin.”).

For situations where the voice command lends insufficient context for local correction, it could be a greater challenge.

The good news is that it does look like it’s possible for us to have these distributed systems replace proprietary cloud solutions. The questions are when and how they will emerge, and where they might be weaker.

Categories
data

About the Privacy Argument Against Autocars

Image of an overgrown field with the remnants of a car visible (back wheels, steering column).
By Ben Salter (Flickr: ben_salter)

One of the arguments against self-driving vehicles is the privacy argument. Won’t you be tracked? Won’t police be able to stop the car? What if the navigation is hacked? And so on.

The problem with this argument is that it avoids the fact that we have the same problem already in many other facets of our lives. The issues are only more obvious and accute when you’re talking about putting your life into the cyberhands of an algorithm.

Society has a real need to confront the security and privacy issues much more directly than it has done. Autocars may raise the issue to higher prominence, which may help us strike a new balance sooner. In that, it could be a feature. But how we ultimately deal with the erosion of barriers to privacy and security is still unsolved.

It will need to be solved even if we stuck to manual cars, of course. But it also needs to be solved with televisions that watch you, phones that listen to you (for voice control), and similar services. It needs to be solved when the day comes that your phone tells a restaurant you’re allergic to something. And so on.

There is a balance to be struck between providing information and retaining privacy. And we have yet to strike it in most cases. Our political world is full of dark money, where donors choose not to reveal themselves while attacking others. Our tax code is full of subtle blind alleys where large companies and the very rich hide their money.

What you buy is tracked, which is one of the reasons that some companies are shunning NFC-based payments like ApplePay. ApplePay would reduce the information they receive when you buy something.

And, of course, online you leave your digital footprints as you jump from reading Eight Exercises that Your Ancestors would Laugh Their Asses Off at You for Doing to ordering food online to reading this blog.

Point is, we’re already being tracked through all manner of invasive tools both in meatspace and in cyberspace. One more meatspace tracking measure does not seem to raise itself in priority above balancing them all correctly and comprehensively.

Even your goods are tracked as they are shipped to you. And you like that. It lets you know when your stuff will get home.

Done right, instead of waiting on someone running late for a meeting, you could see that they’re stuck waiting for an autocar. Done wrong, you might have a surprise party ruined because the birthday human sees that everyone’s at their house. Or couples might catch each other cheating. Or stalkers and criminals will hack the system and use it for evil means.

But the good news is that there are real enough non-totalitarian harms to giving up privacy to make strong arguments for laws and technical designs that let us retain privacy, even in autocars. The balance is yet to be struck, but the reasons are there for it. It may not even be a world we find comfortable, it may be less private than we would like. But there’s no indication it will be as bad as the tracking that’s already going on today.

Categories
data

Music: How we listen

I don’t have the statistics, but many different players and websites including iTunes and Last.fm include the ability to track what music you listen to. In theory this data from many users can be aggregated. If that happened the picture would look something like a bell curve.

The top dominates

Most of the music is from artists the listener likes a lot.  This is right tail on a bell curve.  For example 50% of songs might come from ten artists, 80% from 20 artists, and 90% from 50 artists.  The other 10% might come from hundreds.

The same is true for albums: the favorite album by the favorite artist will be even more dominant than the artist was.

Selling as generic

The problem is that the industry treats songs as equal units.  You pay roughly the same price for a song you’ve listened to 1000 times as one you listened to once, or a song bought as a gag.  But when you actually look at the cost per listen it becomes apparent this is simply silly.

The songs you love cost you tiny amounts: after the hundredth listen to a $0.99 song it’s less than one cent!  The songs you don’t love cost you more per listen: up to that same $0.99 for listening to it once.

Shouldn’t the opposite be true?  Wouldn’t you pay more money for the song you love?  Wouldn’t you rather pay less for the song you would delete from your collection except that you never look in that folder anyway?

Progressive pricing

My belief is that music should look like the following pricing model.  Note that the numbers are fabricated and that actuaries and statisticians could provide much better figures.  This is only a rough model.

For the first ten listens it costs a cent.  Period.  If you like the song and run through ten listens you pay a cent.  If you decide you don’t like it and give up after the first time, it costs a cent.  For the next ten listens it costs a dime.  Listen 20 times and you’ve paid $0.11.  For the next 50 times you listen to it, that’s $0.20.  After 70 total listens you would have paid $0.31.  And for the 100 listens after that, it’s $0.68 which brings you to the $0.99 original price.

The money distribution is staggered as well.  The artist makes less money off of the first tier and more of the successive tiers, while the labels and distributors make more on earlier and less on later tiers.

Choices

There would be some other choices with this model.  If you knew you’d want the song for 170 listens at least, you could pay an initial fee of $0.89 or such, giving you a discount for buying the song outright.  You could also pay the difference on $0.89 up to 70 listens.

Even after paying $0.99 (or $0.89 if you bought it early) you could choose to pay more.  That money would go almost entirely to the artist.

The model’s logic

The consumer value behind this model is two fold.  One is to save you money on songs you rarely listen to.  The other is to give you the freedom to explore music.  The current flat price model is prohibitive: how many times would you roll the dice at $0.99 per roll?

The model also has powerful incentives for the label, distributor, and artist.  People would explore more music and pay a cent each time, but that would add up quickly.  The current prohibitive model generates less revenue than the new model would for all parties involved.

Other media

This model can easily be extended to be used with movies, television, and text.  The tiering would be different, obviously.  It would not be as effective for news as fiction.  But that’s a detail that can be overcome by changing the target of the model.

Instead of expecting you to pay $0.01 for each episode of the Daily Show each time you watch it, you would pay $0.01 for the first three episodes you watched.  Due to that sort of content being unlimited in time (they continue to make new episodes indefinitely) you wouldn’t cut off at $0.99.  The probable solution would be to tier over an entire season and fix the top-price on a per-season basis.

Advertising

The option to have advertising fits nicely into this general model.  The advertisers can choose to pay for a tier for some number of viewers: when you go to view, listen, or read the choice is yours to accept the advertiser’s offer and instead of paying you would watch, read, or listen to a short advertisement for the duration of that tier.

Conclusion

I believe this sort of model, again with the statistics to back up a more refined pricing and tiering system than I’ve presented, will be a boon to listeners, viewers, and readers.  It will also benefit the content creators and distributors.  I hope to see this model become a standard operating model for content.

Let me know what you think of this model.  What’s wrong with it?  What would make it better?

Categories
data

Modern Fair Use

Jason Beghe, a former Scientologist, spoke out. I’d seen the tease for the above-linked interview and was checking back to see if the full interview had been posted. Then one day the YouTub account of the poster was suspended for prior incidents of alleged copyright infringement.

Currently the trend of media is wholesale ownership of content and increasingly comprehensive restrictions on fair use. YouTube has become an exemplar of this trend, with their shoot-first, answer-questions-never (except “with the user who posted the materials”) policy. They do not have a transparent process for reviewing content removal decisions. If a video could be construed as infringing without examining the complexities of copyright (including fair use) they’ll pull it.

What’s more, they’ll remove the offender’s account and will, in the future, disable any accounts that person makes if they find out. In short, YouTube is behaving as a barrier to free speech. Obviously it’s the company’s right to do as they please with regard to content, but a subordinate organization to Google, one would expect more from.

Examples of YouTube’s strong stance against freedom can be found in at least two examples I’ve personally followed. The first was covered here awhile ago also involved copyright infringement : Back on YouTube! That case amounted to the infringing use of music to accompany what was otherwise non-infringing content.

This time around the issue is xenutv posting Viacom-owned Stephen Colbert clips about Scientology. Xenutv got removed from YouTube before for this, and apparently their policy of banning repeat offenders caused another account removal more recently. Xenutv posted their Jason Beghe interview, and somehow that triggered the account removal.

There are two problems here. The first is a question of fair use. If one so-called channel on YouTube is dedicated to a specific topic when is it not fair use to post a directly-related clip? Arguably posting the Colbert clips is fair use as it was a small portion of the original work that had little, if any, potential of damaging the ability of Viacom to exploit the work commercially.

Viacom is acting for one reason: they don’t want their content reused except by direct permission. They don’t want fair use. And YouTube has been complicit in this sort of behavior. The DMCA takedowns do not account for fair use. There is no review process. Effectively, the DMCA along with its corresponding protocol amongst the media companies is effectively skirting the notion of fair use altogether.

But, in my view, there is a much larger problem. Accepting the current behavior of media companies and YouTube in handling infringement, it should be second nature to hold the process up to the light. That is the opposite of the situation. Content can be pulled from a major site like YouTube at any time for any reason, but it would seem reasonable to expect them to give the public a reason. There should be a public review process that allows the average user to see why the company took action.

Wikipedia does this via the article History does this. You can look at the revisions, you can look at the talk page. While YouTube is no Wikipedia, they should still strive to be open about why content was taken down. Otherwise we have no reason to believe it wasn’t outright censorship rather than alleged copyright infringement.

So, from all of this discovery we can tune forth a few rules for the modern media:

  1. Be transparent.
  2. Only remove the infringing content.
  3. Offer an alternative source if applicable.

That is, put a process of review in place where users can find out what the reason was for a removal and see that it was handled properly. Don’t cull the forest to remove the weeds. And finally, if Viacom streams the offending episode, maybe YouTube and Viacom could offer the offender the choice of linking to that content instead.

In both the above cases the initial reaction of sane-minded individuals is not copyright infringement, but actions by those who disagree with the content to get the accounts pulled. That definitely was not the case in the first example. Was it, in the second? We don’t know. Obviously something triggered YouTube’s radar. Was it the volume of hits that the interview tease received? We will not know unless YouTube opens their process up to public scrutiny.