Using vocabularies to enforce data consistency

Research data is a valuable asset, and ensuring it is findable and accessible to other scientists is, of course, key to maximising its value. However, this alone is not enough. You also need to ensure that it's interoperable and reusable, meaning others can easily understand and use the data.

Too often, research organisations fall short in this area, and data is stored with slight variations in naming and metadata. While this may not be a big issue in the short term, since the people involved in the project understand the data, it can quickly become a headache for the data manager trying to maintain a quality data store over time.

When multiple researchers - possibly both internal and external - contribute data, and they all have their own naming conventions, things can quickly get messy. And when the people who understand the data thoroughly leave the organisation over time, the data simply degrades, making it harder to use.

Maintain data quality over time

The solution to this is having processes in place to maintain data quality over time, ensuring consistency in naming and metadata. With our Scientific Data Management Platform, grit, we have solved it with controlled vocabularies, which enable the data manager (the admin user) to define and manage controlled lists of values. These vocabularies ensure consistent and standardized data entry across the platform, facilitating accurate downstream analysis and interoperability.

The vocabularies feature restricts field inputs to a predefined list of valid terms. This is especially useful in data collection workflows such as assay metadata, compound or batch properties, and other structured forms.

When configuring a field that supports data type selection, you can bind it to an existing vocabulary.

Once bound, a dropdown will appear in the user interface, allowing users to select defined vocabulary items. During data import, any value in a column mapped to a vocabulary-bound field will be validated against the vocabulary. If a value is not found, an error will be raised and the import will fail for that row. This ensures that only standardised, validated terms are captured across your datasets.

Using controlled vocabularies has several advantages:

You ensure semantic consistency across datasets
You can make sure the data aligns with institutional or domain-specific standards (e.g., ontologies or regulatory taxonomies)
You reduce data entry errors by only allowing specific valid terms

This way, using controlled vocabularies becomes a very efficient tool for creating a long-term data store built on FAIR principles.