I have been part of the "biopharma data management community" for 25 years and during that period I’ve visited many different labs and seen how they perform their data management - from the smallest academic groups to some of the biggest pharma companies in the world.
All this time it feels like we have been talking about the need for "more structured data", "more metadata", “better metadata”, etc. in order to better be able to "integrate different data sources".
The reason why this is still high on the agenda is of course that:
1) Structured metadata is very important
2) We haven't fully solved it yet!
Without it, we all know what happens! Exactly what I have seen in the many labs all over the world over the years:
Data is stored in files - often Excel spreadsheets but also in hard-to-use Word documents, PowerPoint slides and PDFs - and the files are located in sub-folders under the project, the department, the target or whatever.
Or when it's most painful: Hidden in private folders somewhere very hard for others to find.
But in recent years it seems a consensus has been reached and it seems to be flowing down from the high towers of the theoretical data manager to the big pharma "data management groups" and now slowly spreading to the labs in the smaller organizations (<100 FTE).
The data need to be FAIR!
What is FAIR data management?
FAIR is firstly an acronym and short for: Findable, Accessable, Interoperable, Reusable.
Secondly, it's a set of guiding principles published in 2016 to ensure good scientific data management. The principles can be found here: https://www.go-fair.org/fair-principles/
If you follow the principles literally it may seem daunting and maybe even a bit academic and - as I have heard from several smaller organizations - the bar seems to be set very high.
But here I recommend that newcomers remember the "if you reach for the stars...." quote. You may not reach the top FAIR bar but ending somewhere on the way there may be good enough.
Getting started with agreed metadata and more structured storage of your data is certainly better than the state of doing nothing. And then the maturity of the data sets can be improved over time to the relevant level. This approach will also be fully in line with the FAIRplus project defined "Dataset maturity model" which introduces 5 levels of FAIR: https://fairplus.github.io/Data-Maturity/
What value does FAIR add?
The many available FAIR resources not only deliver frameworks, tools, and methods but also a language. A reference point. Hence, instead of saying we need “more structured data storage” - what does that even mean? - , we can simply state that our goal is to make our data FAIR. From here we can become actionable.
Many (smaller) organizations may not need to integrate with other external data sources and hence could argue that most of the FAIR stuff is "overkill". But if we rephrase the official interpretation, FAIR should be relevant for at least all research organizations:
Remember that other people in the organization need to be able to FIND and ACCESS your data. And they need to be able to understand your data (= metadata) so that they can analyze your data and maybe even INTEROPERATE them with other data so that they can REUSE your data for further analysis in the future.
Looking at the FAIR guidelines this way makes sense for most people and thereby clearly states that following these guidelines basically adds quality and longevity to the company data assets that scientists in the lab spend so much time and money producing.
My experience from using the FAIRplus assessment tells me that a maturity level of 3 likely is "FAIR enough" for most organizations.
Having the data stored in structured form with metadata in a data platform where everybody in the organization can access them will bring a number of advantages:
- Easier and faster for all to find and review the data and use it for their analysis.
- The data can still be found and used when people leave the company.
- There is a foundation for all the new and fancy ML or AI analysis. Without quality data, there is no meaningful ML.
- The data is available and accessible for external due diligence if that becomes a need.
How to get started and implement more FAIRness
The first step is basically to agree to get started and agree to become FAIR!
The point here is that there is work to be done - especially for the data producers. It’s not enough to say "we will be more FAIR".
And if you already have a lot of not-so-FAIR legacy data then there may also be a bit of a clean-up job to do to bring that into a reasonable state.
Secondly, do not set the bar too high from the beginning making it all seem undoable. A culture and mindset change needs to happen across the organization. That stuff takes time!
Then, invest in a relevant scientific data management system where you can store the data with all their metadata in an electronic form accessible from a user interface or other compute systems via APIs.
Finally agree on a relevant set of metadata/vocabulary terms to start with and ensure they are used correctly across the organization when registering the modalities and uploading the assay metadata and results.
With that, you will have laid some important groundwork for more FAIR data management, and by following the guiding principles of FAIR you will ensure that your data becomes a valuable asset to the entire organization.