Pruitt’s Data Rule and Deep Learning

(Soon-to-be former?) head of the EPA Pruitt has proposed a public data rule (RIN 2080-AA14). This could be a good rule, but it really depends on the implementation. This post focuses, briefly, on the implication for deep learning science in such a rule.

Briefly, deep learning takes normalized, record-based data and creates a mapping from input data to some per-record output determination.

Think of a phone book (the data) with individual listings (the records) and then some determination you want to do on those records. It could be something very simple (last name has n vowels) or something complicated.

The data itself may be public, but depending on the implementation of the proposed rule, making this secondary data public in any meaningful sense may be very difficult.

There are several challenges. One is simply the amount of records that may be used. Another is the trained network may be proprietary or non-portable or even dependent on custom hardware. There may also be situations where several neural networks act in tandem, each derived from a bulk of training data (some of which may itself be output from other networks), which would further complicate the data access requirements.

But there is also the question of whether the output would be public, even if published. Normally data is public when the individual measurements are available and the methodology behind those measurements is known. But there is a reasonable and inevitable blindness to the internal workings of deep learning. Trying to explain the exact function the machine has derived is increasingly difficult as complexity increases, and even if all the inputs and outputs are public, the transition function may be obscure.

Which isn’t to say that data, methods, and findings should not be replicated, peer reviewed, and subject to introspection. The EPA should, for example, draw a stricter line against carbon fuel companies and other chemical companies, requiring that more of their filings be public.

In the case of deep learning, not for the EPA’s sake, but for the sake of science itself, better rules for how to replicate and make available data and findings are needed.

Others have already pointed out the difficulty of studies predicated on sensitive personal data like medical records. But there is a general need to solve that problem as well, as the inability to examine such information may block important findings from surfacing.

This is similar to the fight over minors buying e-cigarettes online: opponents of e-cigarettes act as though there is a particular, nefarious plot by vendors, but we do not have anything close to a universal age verification system. Better to develop one for all the tasks that require it.

And so it is with the EPA rule: Congress should draft a law that allows all scientific data used by the government to be as public as is possible.


Federal Reporting Should Be Automatic

With the recent gun massacre in a church, it came to light the attacker should have been barred from purchasing firearms on the open market due to a prior conviction. And now Congress may amend the law to try to strengthen mandatory reporting. But that’s the wrong move here. Why leave open the option for someone to neglect to do the mandatory when they could require that the system be automatic?

For this and many other data issues, we still rely on some human to either file a piece of paperwork or otherwise ensure that the relevant notifications are made. That’s wrong. The existence of computerized records means that such notifications and updates should be completely automated. This includes the elimination of the need to acquire certified copies of birth, marriage, and death certificates, along with other routine and necessary data sharing. There should be a widespread effort to let computers do what they’re good at so that humans don’t have to.

With automatic reporting, mistakes will still be made by humans. There needs to be an auditing process and a corrections process. But even there, once corrected, the updates should be automatic.

We can move toward blockchain-backed systems that allow for improved recognition of where errors have occurred and been corrected. But it’s high time that we remove error-prone mandates that pass without action.


Big Data Needs Variables

Many concepts from Mathematics, Computer Science, and programming should be leveraged to improve our social/market interactions. One such concept is the variable.

You may know one from algebra as x. x = 3; x + y = 7; y = 4. But now x = 11, which means x + y = 7; y = -4.

A sensible, successful information society depends on proper segregation and apportionment of data. But you wouldn’t know it based on the governments’ and corporations’ attitudes towards our data.

What do I mean by segregation of data? I mean that certain information is need-to-know. For example, a corporation does not necessarily need to know my physical address, my e-mail address, my phone number, my date-of-birth, et cetera.

Why do they ask for these things, then? Because they don’t have an alternative choice.

Why should they want an alternative? Look at my examples above and you will notice that except for date-of-birth, they can and will all change from time to time.

What they should want, in lieu of an e-mail address or credit card or other billing data: a variable.

A credit card’s processing information is sort of like a variable. You can pay it off with cash (if the issuing bank is local to you), check, other credit, et cetera. But in its use, it has become largely become a value in itself. It expires, and knowledge of it is treated as authorization to charge to it.

A variable names a piece of ephemeral data. You can e-mail me at variable@variable.invalid (which might as well be programmatically generated for our purposes), and that can then point to my current e-mail.

A strong variable system can mean that I control the value on an ongoing basis, while depending parties don’t have to worry about me updating their copy of my data when it changes.

Have you ever changed e-mail addresses and had to go to umpteen different online accounts to change it? Maybe not if you’re young enough to always have only had one account, but if you’re old enough to have seen your primary e-mail change from, say, hotmail, to gmail, and maybe something else in the future (eg, an employer-based mail or some secure alternative, at least for some uses), you know that pain.

That needless pain, which harms the corporation just as much. Because they see some value in knowing how to contact you, but not enough to recognize the real and profound risk they are placing on themselves by not looking toward a variable-based solution.

Okay, but I mentioned something about apportionment of data. What’s that? The data should have a home, and maybe a vacation place or safe house. But it should not live everywhere. A thousand copies of data that do not follow ACID (Atomicity, Consistency, Isolation, Durability) — in this case the C-as-in-Consistency, is begging for pain.

You want data to be properly allocated across the world for security and privacy, too. If you let the data seep all over, that’s a lot of targets to get your information from.

Other benefits include being able to seamlessly transition between services. The next gmail will have a harder time making a dent in the market, when everyone has to change their services to point to their new shinymail address.

The benefits of variables currently get ignored by big businesses, because they think that their database values outweigh the costs. But my guess is if you look at aging databases, like MySpace or Hotmail, they lost opportunities more than they ever monetized their databases.