Home > Tech > Ethics of Data Transformation and Republishing

Ethics of Data Transformation and Republishing

April 15th, 2008

After the reaction of a few friends to my last post, I wanted to put up a separate post specifically about my current opinions regarding the ethics of taking public data, doing a lot of cleanup, then potentially charging for commercial use or download of the new (transformed) dataset. My planned project working with EPA data may not be the last time I do something related to data transformation, so i’m trying to understand the issues here.

Daniel Raffel pointed me to some pertinent info about WestLaw, one of the most well-known providers of information that originates in the public domain. WestLaw takes public legal information, incorporates it into its datastores, then charges a fee for legal professionals to use it to perform legal research. It provides proprietary interfaces, and also made several features that have apparently become indispensable to the legal profession, including a proprietary key-oriented classification of legal data.

Recently, Carl Malamud, who works for Resource.Public.Org, began a project with the aim of making all the primary sources of that legal data available on the Internet. It’s clear from his letter that he believes that making primary source data publicly available does not compete directly with the services and tools that WestLaw provides. However, it appears that WestLaw’s summary publication literature, such as the Federal Reporter, may be the only available published information from the primary public domain data.

Carl is essentially saying that since these summary documents may constitute information derived from data in the public domain, he will be attempting to extract the public domain data from the documents commercially produced on behalf of the government by WestLaw. I am not an information expert about this sort of thing, but it seems to me that a reasonable person would believe that if the original public domain data is ONLY made available in any useful form to a commercial vendor who then transforms it into literature, then the data which is public domain should not be covered under copyright for that commercial vendor. How you go about extracting that data is more of a fuzzy area, but presumably if an effort can be shown that bounds were respected, I think a reasonable person would say that reverse engineering is okay. Carl’s letter to WestLaw carries this type of reasoning down its natural path, and even suggests to them that they save everyone some time and just release the entire text of their publications, free to download.

Anyone who spends any time around primary source data knows that not all data is created equal. If your intent is to provide tools and services around data, there is a good amount of time and effort that must go into transforming primary source data into a useful format for some specific purpose. I believe that an appropriate action on West Publishing’s part is to go ahead and publish all historical cases in text format (not the Federal Reporter, etc., itself), and let anyone else who wishes to transform that data into a useful format go ahead with their project. This wouldn’t exactly satisfy Carl’s request, but it would meet the standard suggested earlier that the primary source data be available in at least one form.

It’s important to note that I’m not arguing that the final products (the Federal Reporter, etc.) need to be put into the public domain by WestLaw. By the standard suggested above, if the data’s available for the public domain, I don’t believe there is a strong ethical basis for compelling a commercial venture to take a risk and completely release all of their products for free. This would preserve some of the economies of scale of data manipulation, while righting the “wrong” that public domain source data is not available at all to the public. Put more simply, if a competitor wished to create a similar research product to WestLaw, the cost of transforming the data into a useful information repository with competitive features would still remain. Any proprietary content inside its publications still remains non-free, and presumably anyone who would buy them for their convenience would still do so. Carl’s threat is that he does have a very strong point if the only option to get the original primary data is through extraction from their commercial resource. If the decision makers at WestLaw decide to completely oppose his reverse-engineering, I think it would be a very politically difficult decision to defend, and could cause the government to step in and make the boundaries between public domain and private very clear. As it’s to WestLaw’s advantage to keep those boundaries murky, a compromise of providing just the case data in a text format seems the best solution for their interests.

Now, as to how this pertains to the situation i’ll be going into with regards to the EPA emissions datasets, all of that data is available via their website as more-or-less large CSV downloads. Each year has a somewhat different format. From what I recall from my earlier work on it, it’s kind of a pain to go through and clean up that source data, and it requires some knowledge about automotive industry emissions standards nationwide. Still, the original information is visible, even if it’s in a format that needs some work.

What i’m in the middle of doing is the standard drill – analyze the datasets, design a fairly acceptable standard schema to use as a blueprint for importing the data, then go set by set, programming transformations from the yearly data into the database. Then, an interface to perform queries can be created, as well as a set of useful services to offer on top of the transformed data. Long after all available data is imported, a maintainer might write new transformations yearly in order to keep the data current. This activity of doing work to transform data from a public domain resource into a different format is original work, and does take a lot of time and effort. Nearly all researchers deal with this sort of work on a regular basis.

What i’d like to suggest is that if data that is already in the public domain, and available on the internet is transformed into a version that is more useful to commercial ventures or professionals, that it is perfectly fine to charge a fee for access to tools or regular “dumps” of those transformed dataset. For one, the data’s already available, and it is not the original primary source data that is being offered for sale. The business product would be the combination of transforming the original data into a more useful form, and then offering either the transformed data directly, or simply services and tools on top of that data.

If anyone wanted to do that work, then re-open it up completely to the public domain, I believe that would be a gracious gesture, but I’m of the current opinion that it’s not ethically or morally necessary. Plenty of goodwill could be achieved by offering scholars, nonprofits, or individuals free access, and anyone who thought the cost is too high could attempt to achieve a lower cost by taking the original public domain sources and doing the work themselves.

That’s my current opinion about all this, but it does seem like there’s a lot of strong opinion out there, maybe not as long winded as me. Feel free to use the comments to let me know what you think.

Tech

  1. April 15th, 2008 at 18:24 | #1

    | primary source data be available in at least one form.

    That’s really all we’re asking. Artificial barriers to entry are obstacles to innovation as well as obstacles to democracy. The benchmark I like to use is that anybody at the end of a cable modem or in a dorm room ought to be able to download the primary materials, no matter how rough, and do something new and useful, whether for academic glory, business wealth, or as a tool for more effective legal practice.

    (FWIW, there is a growing movement of people working together to put primary law materials on the Internet. Some of them are for-profit, some are non-profit … we’re all working for a common aim.)

  2. April 15th, 2008 at 18:37 | #2

    Thanks for your comment, Carl. I think that’s a very fair benchmark, and totally agree with the sentiment. I hope I was able to emphasize the importance of that enough in my post.

  3. Daniel Raffel
    April 23rd, 2008 at 00:16 | #3

    gordon, check this out: http://freegovinfo.info/node/1798 :sigh: it really sheds some additional light on the situation

  4. April 23rd, 2008 at 00:34 | #4

    Daniel, that’s a really depressing read. In the actual FOIA response letter (http://www.scribd.com/doc/2537243/Answer-to-FOIA-Request-2) that Carl got, I think there’s a small contradiction in that they state that some documents had begun to decay at the time of that contract being signed, whereas they also state that providing physical access to paper copies should be enough to fulfill the obligations to the public.

    As even electronic copies are subject to ‘bit-rot’, I believe it is a fair question to ask whether an electronic, open-standards format for storage for these documents that is not owned by a private company is the only way to ensure permanent public access to this important historical archive.

  5. February 25th, 2009 at 09:17 | #5

    Hm. You were working in a very interesting field there.

    My opinion is that it should be accepted to charge for the service and for the Transformation itself. What should be illegal is a) to restrict access to the original pd data and b) to implement those “Artificial barriers to innovation” i.e. to not allow others utilize something you proclaim as a standard to further develop the solution or to find a more efficient and cheaper way.

    Interesting enough, I worked with publicly and not so publicly available data from ngo’s and I believe there is a high demand for this kind of solution in general.

    You should sell such an interface/transformation solution to the Source Company, as I think it is more a know how problem for the responsibles then anything else. They do not know how to efficiently and effectively provide the data in a format, that can easily be utilized by their intended audience. Then you can sell the aggregated and enriched data to a much broader range of people and charge them for the added value.

    So you help solving a problem – inefficiency – at the interface between organizations and/or users and make some money in the process. classic win-win as it should be.

    But in terms of technology: is that not something that should be solved by developing web services rather then by templating, interfaces or transformations?

  1. No trackbacks yet.