ERC data management workshop, day 2


Well, here we are at day two. My notes on the first day are here. We will open up with a short overview of the breakout sessions yesterday.

Life sciences breakout - key points.

The only point that came up that I hadn’t really covered in my notes from yesterday was that the view was that scientists should not become experts in data management, but some training should help.

Physical sciences breakout - key points.

  • open access to data shows the true richness of the data
  • can validate the ownership of data
  • can attract collaborators from other fields
  • advantages of data sharing outweigh the disadvantages
  • process of data sharing starts at the level of instrumentation and common data formats
  • there should be the possibility of DOI-type labelling of data packages
  • how do you deal with large unstructured data sets?
  • legal and ethical issues affect the use of such data
  • there is a difference between observational and experimental data sets
  • how does the distance to the commercial market affect acceptance of and practices in data sharing
  • who makes the first move? – researchers, institutions, funders, societies?
  • do we need a new profession of data curator?
  • appropriately label datasets to support fine-grained attribution
  • develop a culture of acknowledgement
  • provide funding for data sharing
  • Embargoes are complex, different embargoes are needed for different levels, PI’s need some time to work with the data, data collected at the national level should be made open immediately.
    • an embargo can act as an incentive for the timely use of the data for researchers (they need to get that paper out before their data is released)
  • set aside 15% of each grant for data curation and storage
  • that old chestnut “standardisation vs interoperability”
  • update the EU copyright directive
  • who sets the metadata structures in different communities?

Humanities breakout - key points.

DigiPal was mentioned.

  • management must be done at the discipline level, not at domain level
  • needs to be done above the institutional level
  • sustainability is crucial for SSH
  • could SSH learn how to deal with Ethical issues from the life sciences?
    • need flexible sciences
  • ownership of data is discipline depended, one rule does not fit all
  • creation of infrastructures in not an ERC mandate (it makes one wonder why we might be here today)
  • need career recognition

### Open discussion on morning presentations.

Data management starts before the first data point is acquired.

Data and publications need to be tied together.

We need to get the right tools to researchers.

Representation of data is as important as data itself.

I remind researcher to cite data in their reference lists.

There is a discussion around whether raw data should be stored, of if it’s possible to derive the data from code, could that be sufficient, it seems agreed that this needs to be decided by the community to find their own norms.

Roles and responsibilities around costs are one of the main issues that universities are currently discussing.

(Today I learnt about the Digital Curation Centre in the UK, I feel a little bad that I’d not totally been on top of that before).

There is a discussion on data journals and data articles. (I’m not entirely sure that this conversation gets us anywhere further than describing the world as we find it).

There is a discussion around funding, it’s asked whether data management and storage for research data represents a new market for the private sector. Strong reservations are expressed by multiple people, and the idea is compared to what has happened with scientific publications.

Breakout session on incentives.

Paul Ayris - Implementing the Future: the LERU roadmap for research data.

  • each university needs a research data management plan
  • researchers should have data management plans
  • LERU recognises that data should be open by default
  • rewards and incentives for researchers need further development

Excitingly the rectors of the universities that comprise the LERU group were very positive about adopting an open data policy.

The point in the roadmap about incentives for researchers has the optimistic view that there will be real economic benefit from opening up data early, and that will lead to the creation of more resources downstream that researchers can later benefit from.

A significant barrier is that data is not part of the way that research evaluation is done. Everything still hinges on the research article.

Not all journals require data to be deposited. Researchers are not going to deposit data out of the goodness of their heart. There are few rewards for data sharing, even concrete rewards and prizes. No LERU universities have any such prize.

The recommendations on how to improve the situation include the common themes

- cite the data
- enforce data policies
- reward data contributions

Currently a good number of institutions have not developed a good research data policy, or data curation systems or policies. It’s not that it’s not important, it’s just too early in the process. Institutions are currently more involved with looking at open access, open data has just not made to to the top of the pile yet.

Most are planning to do something, they just haven’t started yet.

### Sünje Dallmeier‐Tiessen - Incentives for Open Science Attribution, Recognition, Collaboration.

Questions that come up from researchers

How do I find data referenced in this paper.

This dataset is great! Has the author shared more?

Why should I bother to share my data, no one will see it anyway.

Sünje is working with DataCite and ORCID on ODIN, a way to link data, papers and people. This kind of infrastructure can help answer many of the questions that people have today about data.

Again The Data Citation principles are mentioned.

She gives a great example of how Kyle Cranmer uses his ORCID profile to show how he has contributed to data creation on the ATLAS experiment.

(It looks to me that this question of data citation is now well within the realm of having been technically solved, so we need to move to advocacy, and we need to teach researchers how to do this. The question of “how can I cite data” has a clear answer. Getting people to find out about the answer is the next challenge).

Veerle Van den Eynden and Libby Bishop - Incentives for sharing research data, evidence from an EU study.

They looked at case studies from a number of EU countries across a number of different disciplines. There are a diverse range of methods for data sharing. The report will be online next week and the interviews will go into their university repository and will also be available (Open Data FTW!!).

The incentives that these researchers identified were:

- direct benefit
    - collaborations are more robust
    - career visibility
    - get wiser
    - is better for science

- norms
    - default in the research group
    - hierarchical sharing throughout their research career
    - conservative non-sharing cultures represent a challenge
    - openness benefits research, but individual researchers reluctant to take lead

- external drivers
    - funders
    - data support services
    - publishers

These external drivers are not the main drivers, but they do help to shift the landscape.

The big fear remains being scooped. We need to create a level playing field for sharing. Sharing failed experiments were mentioned in biology and chemistry was mentioned as being very important (but still people do not do this yet).

Data citation didn’t feel that they had to be able to track reuse of their data, but they were expecting citation for reuse.

Micro-publishing and micro-citation were mentioned as important, especially in the life sciences. You need to be able to provide atomic level identifiers.

The report and full recommendations will be available at http://knowledge-exchange.info.

Open discussion after breakout session.

It’s mentioned that there is an error in equating data publication with formal publication. It should be reported as a separate output. It’s also mentioned that in the humanities when data is cited the compilers of the data is currently not included in that data citation. (I have to say that I think that the commenters full comment is not inconsistent with the idea of actually including names in citations, even if they are not being used right now).

Someone asks for a data repository with an embargo for the period of when a paper is under review. Sünje mentions that Zenodo can support this.

There is a very interesting discussion around aggregation of data, vs the original collection of the data. A specific paper is mentioned where there are about 40 authors of an aggregation paper. The data that they aggregated were not in a state to be cited, they are not, at this point in time, citable. It’s put to one of the commenters that he could make a comment in the article on the journal platform to ask the authors to correctly cite the original data that they aggregated, and he said that he would be worried of making a comment like that, for fear of a negative impact on his future funding prospects.

I mention that research assessment needs to improve to seriously look at non-article contributions. I mention that researchers may need to look past the impact factor. There is an uncomfortable titter of polite laughter at the recommendation in the room, and we pass quickly over the point.

We do talk about the concrete steps that are out there to reward this kind of behaviour, and there are no institutions that formally recognise and reward these practices. That’s a bit of a red flag there.

We ask what is the kind of reward that would make a difference. It’s thought that money would be counter-productive. Research money would be nice. Researchers want help to do their work. They want good services. If they can find people to work with who are professionals in managing data, that would be helpful.

Tim Hunt mentions that the ORCID interface is terrible. Work on that would be very valuable. “if you don’t make a good interface, you might as well not get out of bed”.

We talk about whether software should be usable, would that increase the uptake of good behaviour, but there is no conclusion from the group on this point.

We come back to to the issue of what kind of a thing the data contribution is. Do we want databases to count as patents or publications? Do we not want them to count as databases?, actually the point is more about what kind of IP we want for the data, which actually makes a lot of sense as a question. There is a strong call to make the data open. I have some thoughts on the differences between patents and papers. This also touches on the question of who is the owner of the data?

Reporting session from working groups.

Data management and sharing.

- Issue: need coordination between different data repositories and related services

The key message is that a cultural change is needed when it comes to dealing with data.

Collection of personal data for scientific research is considered legitimate subject to safeguards, under the view of
EU data and privacy policies. They are moving towards a one stop shop model for these kinds of data use cases.

It is considered that data protection laws will not require additional resources from institutes (though that’s an opinion that flies in the face of common sense, so it will be interesting to see if it holds up).

Storage, curation and interoperability.

There was a speaker from Data Archiving and Networked Services. It was put that it would be good to

- provide certification for digital repositories  

A lot of technology is working now for managing data, but people don’t know about it, so we need to

- improve advocacy around existing solutions

Key points from this discussion were

- can you trust the data in a repository?

To get to that we need to understand the appropriate level of curation for the data. Metadata is critical. Scientific quality is the responsibility of both the researcher and the institute.

On fraud, who is responsible for it. If it’s found, who owns it?

How do you create a level playing field. It’s mentioned that the UK and the Netherlands are paying for repositories, but that might lead to less open access, as those bodies may decide at some point to no longer make their institutional repositories available to people outside of their institution.

Data discoverability access and reuse.

- deposit you data into existing structured DBs where they are available

Elixir is mentioned in this talk.

There is a new copyright exception in the UK, but this is limited to non-commercial uses. New copyright exceptions are coming online, but they are not perfectly fit, in their current form, to totally support Big Data reuse.

There is a comment that the work Elsevier has done on article of the future, with creating in-article visualisations, involved some discussions around whether these visualisations would be subject to copyright, as they were a derivative work of the original article.

It was mentioned that we need to keep an eye on the emergence of new data types or new technologies. An eye needs to be kept on return on investment.

There is data that shows that an article that has associated data published will get cited more.

If we want open data, then we should also have open access.

When it comes to copyright infringement of machine copying, what should count is not that a copy is made, but the intent behind the copying.

Rewards and incentives for good data management (the carrot session).

I’ve written up this session earlier in this blog post, so I’m going to pass over the summing up of the session.

Breakout session - post summing - discussion.

There is a comment that we need to support the skills for interpreting the data in addition to the skills for creating data. Time for a quick coffee.

That discussion session was fairly low key, I think we have hit maximum overlap on the issues, and we are definitely recycling both issues, and proposed solutions. What the concluding discussion will bring we will now discover.

Concluding discussion session.

PLOS mention that they are going to automatically start to collect usage of data, and extend their ALM activity towards data use. They have an NSF grant to look at this. I understand that this program is called “making data count”.

Good data management is good science!

The carrot is a better approach than the stick. We need to listen to what scientists are telling us about how they see this situation, and we need to be responsive to that.

When talking about raw costs for infrastructure, the purchasing power of an institution or a funder is much bigger than an individual researcher. This points towards an idea where funders possibly ought to do bulk negotiation, and distribute storage or compute credits to researchers, rather than raw funding. This is the approach the Phil Bourne is discussing with the NIH.

There is a discussion on costs. Storage is mentioned as being perhaps not a significant factor, compute and electricity are also mentioned. (I’ve done an estimate that by 2050 it will cost 1$ to store an exobyte of data, however the truth here is that costs are highly domain specific, and there is a wide distribution of use cases and levels of expertise amongst researchers, raw storage costs are only one aspect of the issue.) I think that a general discussion on this topic is not as helpful as identifying specific issues, or specific solutions.

The discussion on enforcement of policy is mentioned. The commission says that they want a bottom up solution, but it is mentioned that a data management plan represents a contractual obligation. (It’s fairly well known that funders are very shy of brandishing sticks, it’s unpopular, it could lead to unintended consequences, but when it comes to altering behaviour through financial incentive it’s hard to see options that could be as powerful as penalties for not sharing data as laid out in data management plans, though given the underlying complexity of different research areas I would not want to be the one to pull that trigger).

It’s mentioned that making papers, data and software open will give a benefit to industry and innovation.

We tip toe over to the topic of open peer review. I’ll just tip toe away from this topic right now, as it’s fairly off topic for this workshop.

Closing remarks

This has been a harmonious workshop. There is general agreement that we should have open access to research data, and we have many interested parties. We have a long way to go, we also have agreement that we need to change the culture at every level, and that we are possibly not moving fast enough. Being able to hire and obtain technical support has resonated, and has been mentioned several times (I’ll put in another shout out to http://software-carpentry.org.

Where does the data go? Who pays for it? Those are still big questions, and should be developed trans-nationally.

It’s mentioned that we need to identify specific repositories for specific disciplines, and I would refine that and say that we have very clear locations for specific kinds of data right now, what we need to identify are the fields that are struggling now, and in particular identify fields that are at early risk of walking into a data avalanche where there are no previous good examples of data care in those fields, and who have gotten into this situation due to new tools that have become available to them, for example microscopy.

Issues and questions that came up today.

- how do you deal with large unstructured data sets?
- legal and ethical issues affect the use of such data
- how does the distance to the commercial market affect acceptance of and practices in data sharing
- who sets the metadata structures in different communities?
- can we introduce licences that can be interoperable for data?
- who pays, who is responsible for paying?
- Issue: need coordination between different data repositories and related services
- can you trust the data in a repository?

Suggested solutions to issues that came up today.

- give DOIs, or similar, to data
- move towards an internationally level playing field on ethics for research
- create a profession of data curators
- appropriately label datasets to support fine-grained attribution
- develop a culture of acknowledgement
- provide funding for data sharing
- use embargoes as a mechanism to incentivise researcher to make timely use of their own data
- take a percent, say 15%, and set that aside in every grant for data sharing, curation and storage
- update the EU copyright directive
- give a prize for examples of good use of data (it's mention that there is a data prize in The Netherlands).
- convince people to copy good data management plans (and follow them)
- cite data in reference lists, use the [FORCE 11 data citation principles](https://www.force11.org/datacitation)
- create an open marketplace of good data management plans
- data managmeent plans should be a living document
- include the data scientist at the point of experimental design
    - (I'm remineded of a story from Janelia Farm ...)
- cite the data
- enfore data policies
- reward data contributions
- create an EU-wide directive on data policy for scientific research
- provide certification for digital repositories
- improve advocacy around exising solutions
- Funders shuold mandate open data
- The EU shuoljd take care of infrastrucutre euope-wide to promote a level playhing field.
- create a code of conduct teaching young researchers about the ethical issues around data
- depost you data into existing strucutred DBs where they are available
- do bulk purchasing from providers, and distribute compute and storage credits to researchers