(Soon-to-be former?) head of the EPA Pruitt has proposed a public data rule (RIN 2080-AA14). This could be a good rule, but it really depends on the implementation. This post focuses, briefly, on the implication for deep learning science in such a rule.
Briefly, deep learning takes normalized, record-based data and creates a mapping from input data to some per-record output determination.
Think of a phone book (the data) with individual listings (the records) and then some determination you want to do on those records. It could be something very simple (last name has
n vowels) or something complicated.
The data itself may be public, but depending on the implementation of the proposed rule, making this secondary data public in any meaningful sense may be very difficult.
There are several challenges. One is simply the amount of records that may be used. Another is the trained network may be proprietary or non-portable or even dependent on custom hardware. There may also be situations where several neural networks act in tandem, each derived from a bulk of training data (some of which may itself be output from other networks), which would further complicate the data access requirements.
But there is also the question of whether the output would be public, even if published. Normally data is public when the individual measurements are available and the methodology behind those measurements is known. But there is a reasonable and inevitable blindness to the internal workings of deep learning. Trying to explain the exact function the machine has derived is increasingly difficult as complexity increases, and even if all the inputs and outputs are public, the transition function may be obscure.
Which isn’t to say that data, methods, and findings should not be replicated, peer reviewed, and subject to introspection. The EPA should, for example, draw a stricter line against carbon fuel companies and other chemical companies, requiring that more of their filings be public.
In the case of deep learning, not for the EPA’s sake, but for the sake of science itself, better rules for how to replicate and make available data and findings are needed.
Others have already pointed out the difficulty of studies predicated on sensitive personal data like medical records. But there is a general need to solve that problem as well, as the inability to examine such information may block important findings from surfacing.
This is similar to the fight over minors buying e-cigarettes online: opponents of e-cigarettes act as though there is a particular, nefarious plot by vendors, but we do not have anything close to a universal age verification system. Better to develop one for all the tasks that require it.
And so it is with the EPA rule: Congress should draft a law that allows all scientific data used by the government to be as public as is possible.