EC consultation on Open Data - a report.

Posted on 2 July, 2013 at 00:00 UTC

Category: open access open data horizon 2020 EU science publishing eLife

This is a report on todays consultation on open data that was help by the EC. The notes are long, so I have put my conclusions and general comments at the start.

General comments

There was not much disagreement throughout the day. There were repeated calls for the need to incentivise researchers to engage in data sharing, but not too many concrete proposals on how to do this. It does seem from my perspective that libraries could do an amazing job here, but that will depend on to which extent these libraries have deep technical expertise. One problem libraries seem to have is bridging the gap between their expertise and the scientist at the bench who just doesn’t know about what services they can call on.

I spoke late in the day, but I was the first to mention CC0 explicitly, and the first to call for the explicit adoption of CC0/CC-BY, I was surprised by this.

There was an overwhelming reiteration that primary research data is a public good, and as such the default position is that this data should be “open by default”. This was hugely encouraging. There was plenty of nuanced discussion that there are indeed areas where one would need to have restrictions in place around certain kinds of data, but the majority of people who made this point wanted to start from a default open position, and look for explicit reasons on a case by case basis for why one might not adopt this principle. I think this is a healthy way to proceed.

There were some skirmishes over IP, patents and an explicit call from representatives from Phillips and from the German Defence industry that data should not be made open. One even saying that they liked public funding, but didn’t like the idea of opening that data (hello, can someone please let this person know what a “public good means?"). Anyway, both representatives were amenable to the idea of embargoes for data that is generated in public-private partnerships, so I think that was healthy. One aside is that this thread of conversation popped up throughout that day, but I feel that it is largely a distraction from the core question, one of the status of OpenData for primary publicly funded research. What it does show is that in this debate we need to get the lines really clear, so as not to waste cycles discussing edge cases, and so that we don’t end up imposing artificial restrictions for fears that should not really be applicable.

No one mentioned linked data. No one.

The really key issue, in my mind, is how do you build a system that captures data in a way that is more robust than the life time of the researcher who created it. If we could say with confidence that the data that a researcher used is as accessible to future generations, in the way that their publications are available, then we will have succeeded. We can still get out hands on the finches that Darwin worked on, which is amazing. That we can’t get the excel file that Joe Postdoc created six years ago is a shame on us.

Notes from the day.

My notes on the actual discussions through the day are pretty much sketch like. If anything is unclear, I’m happy to respond in the comments.

Opening notes

Opening remarks, sets out three reasons that the commission believes the opening access to research data is a must.

good for science
good for SME’s - they have evidence for this already
good for the citizen

There will be room until the 15th of July to send written contributions to the consultation. Today is not a workshop, it is a hearing, and the main purpose for the day is to hear opinions from the stakeholders.

The day starts with:

The research perspective.

Jildau Bouwman, TNO Department of Microbiology and Systems Biology

negative data.
small data from home experiments.
information in the methods section, including the meta-data data, the paper should be reusable.

Limits should be on sensitive data, commercial data. Need a specific budget in projects to help data being put into the open.

Paola De Castro, Istituto Superiore di Sanità

data management plans should be stressed
mentions the g8 policy
need a way to provide incentives for researchers
they stress the importance of creating a global infrastructure.

Menno Kok, Erasmus Universiteit Rotterdam

Discusses the why of this.

An important factor in the why, is how can we get more value from data. This means data enrichment. Mentions the danger of making some data available, specifically genome sequences. Sequences and phenotypes are the things that are dangerous.

How do we stimulate the process? This is to do with fairness. This relates to when data should be made available.

The patent question is going to be an important one. We may have to come to a tailor made solution that fits to all types of research.

How can you get to this kind of solution? Through trial and error.

They would like to propose that EU incorporates OA under Horizon 2020 as a carefully monitored limited trial.

Salvatore Mele, CERN

How can you get 1000’s of people to share and to cooperate? If you build a community, where every contributor is known, and every contribution is acknowledged. He advocates that we can build a global community of sharing.

He mentions the ODIN project. They found a very clear answer, they need to augment the existing infrastructure, need one that is technical, social and something else.

Mentions key pieces of the infrastructures.

ORCID
DataCite

We need to accelerate the adoption of these approaches.

No researcher should be left behind. Researchers without access to specific infrastructure should be able to make use of tools such as Zenodo.

Corrette Ploem, Academic Medical Center, University of Amsterdam

Patients, which are the providers of data, may expect that their data is shared as much as possible, from their perspective, It is not in their interest that researchers sit on their data (the patients want to be cured, right?).

Standardisation of data, and encoding techniques, and legislation, on an EU level, is required.

?

Incentives are crucial to build up a culture of data sharing. Research data needs to be considered as a research output on the level of journal articles.

Costs should be included in project funding. Data management and data sharing plans should be required. Such an approach was proposed by the US government.

Initiatives such as the Research Data alliance are helpful.

Andrew Smith EMBL-EBI / ELIXIR

We need to look at the term open data. That is a bit misleading, in life science, clearly not all data will be made open. When we talk about open data, we really are talking about accessible data.

Mentions EU-PMC and EBI infrastructure. Where we can we should look to build on existing data bases.

When we use the term data storage, we need to be careful. The costs in running these repositories often sits on the curation, running courses, developing standards, the cost is not just on the storage side.

Feels that we should use Horizon 2020 for driving change.

Rolf Vermeij, European Consortium of Innovative Universities

Need to be able to find the data through a search engine. Need ways to enable searching that goes beyond Google.

Some areas of science have a long standing history of data sharing. Chemists do not share data.

People will need to be educated.

Need stronger peer review on the data.

Debate

A lot of data that is used is often coming from public administration. There is not much of a culture of data sharing from public sector information. These sources of information should be considered.

Education is important. We need to convince researchers that their data is worth sharing. We need to educate researchers about what the basic elements of sharing are. Feels that naturally libraries have a role to play in that.

There is a study that says that peer review of data is just too difficult, there is too much of it, and it would just break the system, we need to think of some other way of doing this.

One approach is to just allow users to make comments on the data that they are using. The second approach is to do this through journals, journals should ensure that data is reviewed.

If there is a clear understanding of whose job it is to do what when it comes to review, that’s helpful.

My comments on peer review of data:

aside from reuse, making data available helps to prove that the experiment happened
not the only solution, but tracking reuse is a good indicator that the data is useful
must support negative publication results, to overcome publication bias for successful events in the lab.

A comment is made about what is data, archaeology provides great examples of how heterogeneous data is.

If you can really associate who has provided which part of the data, then you are pinning reputation on the quality of the data. If you identify who is putting data out, you do not need a peer review system.

In medical science the citation index is more important for your reputation than the quality of the data that you produce. Therefore some fields need education, and a change to their incentive structures.

Industry / industrial research perspective

Jan van den Biesen, Philips Research 2

OA to Scientific publications is really not an issue. No interference with the ability to protect IPR.

Open access to research data is another matter. Open access to research data could affect the ability to protect innovations and IPR. Fully OA might destroy more value than it creates.

They think it should be decided case by case. For example unsuccessful clinical trials, sharing these results can help reduce redoing unnecessary experiments, however making data from Enabling technologies open could scare away partners.

They support the OA approach proposed by the Obama administration.

Making raw data available to citizens doesn’t really help, this data should be refined into products by industry for citizens.

Helge Pfeiffer, Advisory Council for Aeronautics Research in Europe /

Needs to avoid inflation of papers - though salami slicing and fraud.

Thomas Weise, Federation of German Security and Defence Industries

Funding is highly appreciated, however OA cannot be in the interest of industry. 100% ownership of background information has to be guaranteed to industry. Release to 3rd parties has to be agreed by industry.

Debate

Strong debate on the topic of standard position for openness.

Someone makes the case that there is a strong difference between pure research and applied research. They believe that this is one of the things that the Horizon 2020 pilot should investigate.

Mentions that there are parallels in an amendment that has been seen in public sector information directive, and the research cycle within industry. The directive has said that for public private partnerships, the default will be open, but that openness will happen under embargo. In addition those embargoes can be challenged. The PSI directive might provide a good framework.

Thomas Weise could agree to this idea for embargo. In the US in defence, secret programs are suddenly published, but this means that they might not be interesting any more.

q: does it make sense for Europe to have a policy, in spite of policies in other parts of the world, or do these policies need to be global. Thomas Weise says that there should be an EU policy in order to retain EU competitiveness. Need and EU publication strategy and research data strategy.

It’s important that Europe is showing leadership in open policies. On the other hand you cannot limit open access to within specific regions. So - yes European policies make sense, especially to join forces with other regions that are interested, but you cannot limit access to this research.

Do not think that we should wait to harmonize our policies.

Research funder perspective

Juan Bicarregui, UK Research Councils

Thinks most countries are already producing polices that are already harmonized. G8 agenda is mentioned again.

STFC holds 40 PB of data, doubles every 15 months, soon they will hold 80PB. They also support a bunch of other tools, including the Square Kilometre meter Array.

David Carr, Wellcome Trust

Value of data vs resources required. There can be limits to data sharing via IP. Need to balance the needs between data generators and users.

There are challenges:

enhancing implementation and enforcement of policies
guiding a sustainable culture of data sharing
recognise that different disciplines are at different stages
need to forge partnerships between funders the research community and other stakeholders

Anne Wetterbom, Swedish Research Council / Science Europe

Funders role is to provide the framework for their research environments.

Swedish government in October 2012 published a bill on research and innovation. There is already a Swedish bill for making public information open, and research conducted through universities is considered to be public information. The universities are responsible for archiving data from their scientists, but this puts a burden on universities, with the current data deluge.

They want infrastructures to be cost efficient, and heterogeneous.

They will work with different stakeholders over the coming year.

During 2014 they are going to go to the government with a draft policy.

They would like to have a discussion on funding models.

Debate

A comment to focus on success stories, which can be used to show the value of access to open data.

Should the way that IP is currently working be discussed, particularly around patents?

Perhaps we should look at the property issue in a new way? In medical research contracts are very one sided. This is really a problem, we should think about the cash that is being generated through these partnerships.

Public directive indicates that from 2015, any information that ends up in the university library will be considered as public information, including public sector information that has been generated from within the university.

The question of licensing is being raised. A proposal is made to clarify copyright and licence, and suggests that there should be a limited set of patterns, like as in what has happened with creative commons.

The decision about licensing should happen at the proposal stage, so that funders will know whether to fund up front.

There is a defence of the patent system, a description of “the deal” for patents. (There is a strong [argument that the patent system is broken are broken][brpat], as this American life episode The White House seems to agree)

For open data, public funding is involved, and this does not preclude the protections that companies have. When we are talking about open access to data for public funding, we should not add more protections or additional layers of protection, as these layers already exist - via the patent system.

Information systems / e-infrastructure perspective

Nikos Askitas, Institute for the Study of Labor in Bonn Germany

(I think this person is an economist)

Sharing is a good thing, but not all researchers should share, it’s like donating, it’s a good thing, but not everyone will do it.

Data is insurance against fact pollution. Data is not cheap.

You have to make sure that the data stays meaningful over time. Research data is potentially any data.

Research data could be defined as data that has been used at least once to answer a research question.

What does making it available mean? Making it available separates research from hearsay. That must be what defines openness in this context.

On limiting, two remarks - open does not mean free, at the same time proprietary does not mean closed. In the context of data, perhaps we could introduce a data tax.

Perhaps an idea of corporations paying for proprietary data, could be opened in the form of data taxes (definitely an economist).

In terms of storage, there could be journals, libraries, individuals. So centrally or distributed? The library of Alexandria no longer existed. Monks stored the data in a distributed fashion.

Donatella Castelli, Italian National Research Council and OpenAIRE / Yannis Ioannidis

Requirements on data preservation, and management should be light at the project submission stage, and become much more rigorous before awarding of grants.

Openness should be limited to quality data.

Peter Doorn, Data Archiving and Networked Services

Important to address small data, in addition to BIG data. We should not make open data a religion or a dogma, it’s important to be pragmatic.

Researchers should not own the data that they collect with public funding. On limiting openness, protection of privacy is a factor, but it should not be a dogma. Certain public interests should be protected.

It is good to allow an embargo for up to two years, for researchers who want to publish on data.

On reuse, we need certain citation rules for data. These should include at least a persistent identifier. Make data available for peer review.

It should be stored in trustworthy archives, should be certified by the EU framework - there are German and ISO standards, for data archiving standards (could be very helpful).

Make data management eligible for funding.

Matthew Dovey, Joint Information Systems Committee / Knowledge Exchange

Half of funding agencies in north Europe had data management plans, but only half of them had plans to implement these plans.

Makes the point that sometimes it’s cheaper to recreate data, rather than storing.

Data is often generated from an array of funding, and researchers are often not aware of the funder requirements.

Funders need to fund ongoing support of data.

Again, training is important. Do we concentrate on new researchers, or the exiting researchers?

Infrastructure is easy, technology is easy, getting people to use the infrastructure is harder. (often often, the social context is harder than the technology).

Any technology must fit with existing workflows and not impose new workflows.

Adam Farquhar, DataCite

DataCite now have 1.7M DOIs. +3M resolutions in 2013, 200 data centres. 275k DOIs in 2013.

Founded in 2009.

Data identification has now matured up away from local country standards. The point is there is no need to re-invent the wheel. Identification and citation level meta data are critical for incentives systems.

Data citation require interoperable APIs and meta data (e.g. content negotiation with crossref).

Data identification is more than just assigning a number. You need essential services to support this.

David Giaretta, Alliance for Permanent Access

The common thread is how can we add value? Not just adding value to the creator, but also in other disciplines - commerce, government, the general public.

Most data is unfamiliar to most people. Most people don’t think anything of clicking through 100 different web pages, most people would never do this for data sets - life is just too short.

The key question is who pays, how much, and why? No one makes indefinite commitments.

The solution seems to be to make data usable, by as many people as possible, for as long as possible. CIBER-DS is a project that is trying to do this.

Need to investigate data marketplaces.

Bram Luyten, @mire

Data should be stored in the research institution. Researcher’s are able to forge large volumes of data. The reputation of the institution is at stake. The academic institution has a horizon that is longer than the span of an individual career, or a single project.

Discussion

Two specific questions - what should be the embargo period? When do you start counting the embargo period. Would it be reasonable for a funding agency, that has a rejection rate of 90%, should they ask for this up front, or as a first deliverable?

Someone is missing the researcher in terms of how we are arguing how things should be like. Many people are arguing that their projects should be the one that holds the data. This person (the economist again), does not thing we should over burden the researcher. We may end up with shiny open mediocre stuff. Putting all of your data into one big trough makes it easy to put in, but hard to get out, thinks it is better to have small projects, with community driven curation, the solution needs to be distributed. (of course this perspective does not address the actual problem that our current data policies address).

Salvatore says that adding more things to do when writing a proposal is not great, but some thinking about data management can be hugely useful. We could think of a process which encourages people to make sure that the data curation and opening can happen, for example, allowing time at the end of a project with funding, for doing the curation. Set an example, and let people know that it’s OK to take time out from core research to make the data open.

If you want to convince researchers to publish data, you need to make researchers understand what is happening with their data - simple legal templates could help with this.

Someone says something, but I couldn’t follow what they were talking about.

On the question of embargoes it’s not possible to say that this should start at the end of data collection, as this is not a well defined point in time. In the data management plan if there is a request for embargo, this should be laid out in the data management plan. A way that could work, is to tie it to the end of the funding period, and tie this to the need to have embargo requests as part of the data management plan negotiation.

In the UK, we are seeing that data management plans are being required up front in the grant application. In terms of support for creating data management plans, there is a role here for libraries to help in this domain.

What you need at proposal time are data management intentions, and what you need during implementation is data management practice.

At proposal time getting a feeling for how much it costs would also be good.

Publisher perspective

Ian Mulvany, eLife / PLOS / PeerJ / Ubiquity Press

See my statement, and slides at my previous blog post.

Fiona Murphy, Wiley

Mentions PREPARDE - this is an ongoing project and set of activities.

Adapting the publication model for publication is about adapting the existing model.

Jarosław Perzyński, Polish Chamber of Books

Polish publishers do not think about growing, rather about surviving. There are two examples of alarming ideas from Poland. At the end of 2012 the polish ministry proposed that publishers mush transfer electronic rights, in this situation the government wanted to pay 50% of the cost, and have 100% control of the work - this would have been a disaster for the polish book industry.

An other example is graphene research in 2012. The question is who is able to apply for the commercial use of this research. OA would mean that only rich states would be able to benefit from this work.

He mentions a question about spying and connects this to open access, but I don’t understand what he said in this regard.

It’s mentioned by the chair that opening things is a way to prevent spying.

Eefke Smit, International Association of Scientific, Technical and Medical Publishers (STM)

Publishers welcome this imitative. It is no secret to us how much of a hot topic research data is.

If you want to reuse data, they must be understandable. There needs to be a connection between research data and publication. This addresses some of the fears that researchers would have in terms of researchers being afraid of others misusing their data.

On the culture of sharing - it is again very important that data is integrated with publications.

Anita de Waard, Elsevier, Data Collaborations

Elsevier∫ - we are a large publisher (one of the best lines of the day).

Storing/annotation/curation is not the same thing as sharing.

They do advocate the creation of data catalogues, so that if even data is not shared, there could be a role for data catalogues, so you can at least discover what there is, who has it, and what the rights are around it.

where do you store the data? there are three types of repositories

generic repositories
domain specific repositories
institutional repositories

They are interested in developing machines accessible formats for interrogating the data

How do we enhance data awareness?

Need to look at how the researchers are working now, need to develop tools that can store data at the point of capture - the older self wants to reuse the data created by the younger self.

Allow the researcher insight into why, and the extent to which data was reused.

They would like to suggest the creation of a shared network of best practice in data sharing.

Library perspective

Paul Ayris - LIBER

Libraries should retool to be able to support data management. LERU is compiling a roadmap for the impact of research data, this will be available end of 2013. It will include looking at costs - the first question that any vice-chancellor will ask.

LERU believes that there are boundaries, not all data can be bade open on day one, but they believe that the default position should be open and not closed.

Thomas Bourke, European University Institute (Florence)

Mirrors what Paul says. There are huge differences between financial economic data and development economic data. The question of scope is a key question. Libraries have established quality control mechanisms around publication, they might be able to provide something like this on the data side.

The source of the data should be captured, is it original data, derived data, modified data?

Would be good if the commission could hear from publishers of primary data - Bloomberg, Thompson Reuters.

Michael Franke, Max Planck Digital Library 3

What types of data should be open? It is important to find intelligent ways to assess how valuable a data set is, or how appropriate it is to archive, rather than recreating.

Continuous monitoring of data reuse could tell you whether a data set should be kept any longer. It should find out how well a data set preforms in terms of reuse

On data awareness and the culture of sharing This contains a motivation problem for the researchers. There is hardly any individual incentive to share this. One way to overcome this dilemma is a reward system for sharing data. Such a system could go hand in hand with the San Francisco Declaration on Research Assessment (DORA).

(at this point in the evening we are starting to heavily retread over points discussed earlier, so I am cutting back on note taking).

Discussion

On quality - libraries don’t do peer review. They do do some selection, and they ensure that the stuff you get in is the stuff that you will get out later on. Increasingly we are seeing the use of digitised material as research data collections. The libraries is increasingly becoming the data provider as well as the archiver of the data.

The issue of text and data mining conversation around licensing within the EU and the breakdown of those discussions is raised as a point of discussion.