Overview of the pilot
Our data pilot will provide a mirror of experimental data from two magnetic confinement nuclear fusion devices (Tokamaks) at the Culham Centre for Fusion Energy (CCFE): the Joint European Torus (JET) and the Mega Amp Spherical Tokamak (MAST). The research community will be plasma physics and fusion researchers, engineers and technologists from the 29 members of the EUROfusion consortium and around 100 associated organisations, including those delivering the next generation nuclear fusion device (ITER) in southern France, namely ITER-IO (France) and Fusion 4 Energy (Spain).
The scientific & technical challenge
Data from the JET and MAST experiments has been collected over many years (JET has been operating since 1984). It is hosted at CCFE and made available via bespoke APIs and visualisation tools. We would like to make more use of standard data infrastructure including object-storage platforms and modern APIs. The challenges in making use of a third party platform include:
- Maintaining the native data versioning and validation status information
- Maintaining the link to local identifiers for data items
- Not losing information from the native hierarchical structure of the data
- Complying with UK government and EU policies on hosting and access restrictions
- Keeping mirrored data in sync as new versions of individual data items supersede old ones There is scope for EUROfusion members to make more use of each other’s data. We intend to make it simpler to access JET and MAST data remotely.
Data volumes are ever increasing – both the total per experiment and the size of individual signals such as high-resolution camera data. It’s necessary to plan ahead and evolve our data infrastructure to cope with this continued growth. We are also keen to develop and pilot data management approaches for the next generation nuclear fusion device, ITER, which is currently being constructed in southern France. ITER’s individual experimental runs will have a much longer duration than the current generation of tokamaks and will generate up to 0.4PB of data per day. There is lots of potential for researchers to make more use of HPC facilities and we aim to provide more convenient ways to make data available for this purpose. We estimate that several hundred users might initially make use of the EUDAT data mirror once it’s fully tested and publicised.
Why EUDAT?
The EUDAT platform appeals because the general approach and the services available align well with our own ideas about the future of our data management infrastructure. The Europe-wide nature of our research community is a good fit with EUDAT’s scope.
B2SHARE will be used to provide on-demand access to individual data items via APIs. We will collaborate with EUDAT developers to address some of the challenges around data structure, versioning and access controls. The EUDAT move towards cloud-like interfaces based on SWIFT matches our own current approach. We will help guide the development of a RESTful interface to meet community needs.
B2FIND will be used for data discovery. We aim to provide improved meta-data such as aliases or tags for commonly used signals to help users who aren’t familiar with the machine-specific signal names.
B2SAFE will be used for resilient data storage. This will improve the redundancy of our data management infrastructure and allow bundles of data to be downloaded for particular purposes.
B2STAGE will be used to test shipping data sets between EUDAT storage sites and HPC clusters at Harwell (UK) and CINECA (Italy). This will reduce the need to create and move data bundles manually which can be difficult to manage and break the provenance chain.
Expected outcomes
The EUDAT data pilot provides us with a chance to explore how our data systems can best be integrated with cloud-based data management infrastructure. Because it is separate from our existing systems, there will be more freedom to come up with the best solution without having to address total backward-compatibility from day one. The project will enable users from the EUROfusion community to access data from the JET and MAST experiments more conveniently without complicated remote access arrangements. We intend to extend the number of researchers using the data by making it more easily discoverable and providing access in more convenient ways. The ability to ship datasets to HPC clusters for processing should encourage more use of these facilities and improve the convenience and traceability of the workflow.
Expected domain legacy
This study will be a proof of concept for delivering nuclear fusion data on EUDAT services. If successful we would like to make it the primary route for remote access to our data and continue to improve the meta-data and access interfaces. Use of EUDAT could be a means of ensuring the continued availability of the data beyond the lifetime of the current experiments. In the longer term we could aim gradually to increase the scope of the data hosted to include more of the raw data from JET in addition to the more commonly used processed data. If the pilot is successful we hope it will grow into a shared repository for data from other nuclear fusion experiments across Europe. This could be a step towards more common tools and interfaces, shared between the various experiments. We are keen to develop and pilot data management approaches for ITER, the next generation nuclear fusion device, along with our colleagues in other organisations.