The pandemic has exposed weaknesses in conventional machine learning (ML) algorithms, which have been unable to adapt to the new normal. However, even in periods of relative stability, these algorithms are liable to fail, according to a recent report from causaLens, which operates an artificial intelligence platform used by financial services companies, including Aviva Investors, CLS Group and hedge funds to automatically extract causal insights from financial data.
The problem goes by many names, including “underspecification”, “the multiplicity of good models” and the “Rashomon effect” — after the 1950 Kurosawa film which tells the story of four witnesses giving incompatible descriptions of the same incident. Like the witnesses in Rashomon, many equivalently good models give incompatible descriptions of the data and make widely varying predictions. Underspecification is exacerbated by several factors in conventional ML pipelines. Conventional ML models are known to be biased towards learning spurious correlations, which are likely to be fragile under real-world conditions.
The problem illustrated
Consider an insurer pricing motor insurance premiums. Actuarial datasets are big and growing, as traditional risk proxies are increasingly augmented with new kinds of data: telematic data and data about the vehicle itself, especially any on-board Advanced Driver Assistance Systems (ADAS). Typically, insurers rely on “generalized linear models” (GLMs), simple ML models that relax some of the assumptions of ordinary linear regression, to calculate insurance premiums.
GLMs are popular because, although simplistic, they are explainable, and so meet regulatory requirements, and are easy for actuaries to interact with. Worryingly, chance plays a pivotal role in determining whether conventional ML pipelines work in the real world”.
Some perform well in the real world… Many models fit the data equally well …and others fail. But GLMs are also underspecified, making them fragile in production. If there are fifty features in the dataset from which five variables are selected, then there are approximately two million combinations of features that can be used in the GLM. Many of these models will be roughly as good as one another in development but will give very different pictures of the underlying risks in real-world settings.
A minority of insurers use more powerful ML algorithms, like neural networks, for risk pricing. Insurers have been slow to adopt more advanced algorithms in part because they are “black boxes” that fail to offer the transparency or fairness characteristics that regulators require. These more powerful ML models suffer from a more extreme form of underspecification. There are vast numbers of ways of parametrizing, say, deep learning models that achieve roughly equal loss in training. Many of these possible models are likely to pick up on misleading correlations in big actuarial datasets.