The operation of AI is based on data. For this reason, the properties of data require more and more consideration from an ethical perspective. Data is not just static material but a comprehensive process that includes
setting objectives for the system
identifying datasets relevant to the objective
collecting training datasets: methods and management
analysing the quality of the datasets
cleaning up and curating data for mechanical processing
generating a model and testing it
processing production data
continuous monitoring; updating the model if necessary.
The "direction" of AI ethics questions has been towards outputs and not inputs; we should focus more how the data is produced and processed.
– Researcher William Isaac, Google DeepMind
Updated: 9/11/2023
Quality requirements are highlighted when data is shared
In a society based on a data economy where data is shared and used by various authorities and even the private sector, it is not enough for each organisation to have internally consistent data procedures.
Different actors may have different ways of storing and updating their datasets, but structural and semantic differences in the data makes its sensible and secure shared use difficult.
Data itself does not contain any solutions or meaning; those qualities are not generated until the data is used. Since each use case is unique, the value and significance of data interact with end users’ actions. So, it is necessary to have communication and feedback channels between the producers, owners and users of data.
AI diversifies opportunities but demands more from oversight
AI technologies have enabled two significant changes in the utilisation of data:
data originating from several sources can be analysed simultaneously and crosswise
loosely structured or even completely unstructured data can be analysed and used.
In fact, the question of the quality of data used in AI systems is no simple matter. Traditional quality factors, such as up-to-datedness and internal integrity, are still relevant, but they are now assessed across several datasets. Similarly, the integrity, security and compliance of the data has to be assessed in a more multidimensional manner.
A partial solution to strengthening the management of complex and varied data could be the systematic use, classification and indexing of metadata. It helps to keep non-structured materials better “visible”, and it also collects material from different sources into a cohesive meaning space.
As our society becomes increasingly led by the data economy, the emphasis on work related to data and its nature will change. Producing the training data required by AI systems is not a straightforward process. Responsible and appropriate training data requires
finding, evaluating and cleaning up suitable datasets
identifying and managing potentially harmful biases
classifying and annotating the data.
After production, an AI model has to be fine-tuned and validated to ensure the accuracy, quality and completeness of the data. This may include a large amount of mathematical processing.
We recommend getting a concrete understanding of the time and competence required by data work; one way is to do it yourself. Citizens and organisations may still imagine that AI runs like a charm almost by itself, as long as it has been given a suitable dose of data. Data work is still human work whose complexity and resource requirements must not be underestimated.