The arrival of machine learning (ML) as a mainstream business capability has created a plethora of new cybersecurity risks that are still not fully appreciated, Thomas P. Scanlon, technical program manager at Carnegie Mellon University told delegates at the ISC2 Security Congress in Nashville, Tennessee this week.ISC2 Security Congress Day 1 2023

Although Scanlon’s background straddles cybersecurity and, more recently, ML, this work profile is still unusual. In today’s organizations it is far more likely that ML professionals will have a background in data science while cybersecurity people will have grown up with security systems and networking. This technical and cultural division could slow down understanding of the ways in which ML security is different and distinct. “We’ve got to bridge this gap from both sides, “said Scanlon.

His high-level message was that securing ML will require a deeper understanding of ML-specific risks that can’t be addressed through traditional approaches to cybersecurity or software development. He offered three examples of how ML has come unstuck with negative effects:

  • In 2018, Amazon abandoned an internal recruiting tool developed to identify job candidates that turned out to be biased against women. This issue of ‘distributional shift’ was caused by the skewed training data the system’s models had been fed as the basis of their output.
  • The now infamous 2016 example of Microsoft’s experimental Tay chatbot designed to learn from its interactions with the public. Bombarded with offensive interactions, the chatbot quickly adopted the extreme language and attitudes of the material it was being sent.
  • The numerous problems associated with self-driving cars which Scanlon said were connected to incomplete testing based on optimistic assumptions.

Scanlon drew attention to important differences between traditional cyberattacks and ML attacks. For example, exfiltrating data is not an objective of ML attacks which are concerned with interfering with or skewing data to alter the predictions the model makes. Similarly, persistence is not an ML attack objective in the way it would be in a conventional cyberattack. IMAGE “CongressDay1_Blog3_SessionWriteUp”

Another important difference is that ML systems are often trained on public data (as in the Microsoft chatbot example) which potentially allows an attacker a way to manipulate a system. Perhaps most fundamental of all, testing ML systems is different from the traditional software testing that’s been around for decades.

Secure MLOps

MLOps is the process of taking an ML model from experimental prototype to a production system used in deployed software. As with traditional software development, it is structured and repeatable. The pipeline comprises software and data but also the extra layer of the ML models themselves which creates additional development complexity.

Security issues facing MLOps include data poisoning and model manipulation or black boxing whereby an adversary can query the model publicly and work out how it is structured (or infer the data used to train it). Ultimately, adversaries can repurpose a model. For more attention on model attacks, Scanlon drew delegates’ attention to NIST’s Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations.

How can MLOps protect itself?

For effective MLOps defence, Scanlon recommended encrypting data (avoiding the tendency to skip this stage because ML systems are seen as exceptions to normal rules), using data versioning to keep track of the data used to train a model, as well as data provenance, that is where the data came from. MLOps must know where the data came from and who had access to it. Finally, attention should be paid to data drift (where the data changes suddenly), which might require the model to be retrained from scratch.

Scanlon offered some questions anyone involved in an ML project could ask regardless of their technical background: whether public data sources were being used, what sort of data validation would be carried out, whether any synthetic data used had been vetted for restricted or private data that might be revealed by an adversary’s prompt attacks, and whether anomaly detection was being used to detect suspicious data tampering.

Scanlon concluded with a warning not to rely on traditional security to defend ML: “It is foolhardy to rely on the data and model being protected by broader IT protections. Assume your IT protections can be compromised.”