Faster Learning Control Schemes Can Help To Address Cybersecurity Economically

We have many computer “hosts” in our lives, from personal computers to printers to cell phones to smart refrigerators. On these hosts are so-called “vulnerabilities” such as buggy software and weak passwords. Vulnerabilities are both identifiable in scans and exploitable by malware or human hackers if they have “exploits” that can “see” the vulnerabilities.

Fortunately, patches or security updates can often eliminate vulnerabilities. Even if patches do not yet exist, the vulnerabilities can be addressed by removing access or hiding the hosts behind additional firewalls. If patches do exist they may require testing, to make sure that they do not harm the host performance and manual actions from system administrators. Yet, with large numbers of hosts and tens of thousands of vulnerabilities possible, which actions are cost justified?

The figure below (a) shows three hosts having different sets of vulnerabilities. In the situation depicted, the vulnerability of the  printer is being exploited (the star) causing an incident. In such an incident, an expensive investigation is launched and any users whose information is found to be breached in the printer’s buffer must be notified. All vulnerabilities are given severity ratings by the common vulnerability scoring system. Some vulnerabilities are so important they get names such as “Heartbleed” and logos.

The figure below (b) shows a common policy such that the “no action” Action 1 is used for hosts with only low or medium vulnerabilities (State 1 and State 2 respectively) and the “attempt to patch” (Action 2) is used in State 3 or State 4 (high or critical vulnerabilities on the host, respectively). A major problem for policy development is that there is little or no experience applying Action 1 in states 3 or 4 or Action 2 in states 1 or 2.


Figure 1: Hosts with vulnerabilities at different severity levels and (b) system states and actions. Image courtesy Theodore Allen.

Costs in cybersecurity include repair costs associated with efforts to patch or remediate vulnerabilities and incident investigation costs. Further, there are potentially direct costs relating to losses of information such as purchasing insurance for all the people whose information was lost. There could also be espionage costs as competitors gain information. Costs not relating to repair and investigation are difficult to quantify and might be rarely incurred. Therefore, in our work, we focus on direct costs so that our artificial intelligence minimizes only the observable costs in the formulation below.


How can an Artificial Intelligence recommend patching or remediation actions when it has little data for some combinations of states and actions? The answer is that one needs a “learning” control system which values experimentation. The so-called “Bayesian Reinforcement Learning” (BRL) methods consider all possible actions in all system states and anticipate not only the system moving to new physical states but also the random acquisition of new knowledge or “belief states”. Over time, the system moves with “intrinsic” or built-in uncertainty and “parametric” uncertainty is reduced through learning.

The figure below shows an imagined BRL intelligence having several models in mind but planning on its “future self” transitioning to new states and effectively eliminating some of the models. At the beginning, which models will be eliminated is not known but some will (almost surely) be. Artificial Intelligences that do not consider the benefits of future learning are conservative and not experimental in that they underestimate the benefits of operating the system.


Image courtesy Theodore Allen.

There are many BRL control schemes. In our work, we focus on methods with finite model scenarios. Benefits include simplicity and an easy way to address learning both from observing state transitions and from observing the rewards or losses incurred. Yet, it is important to consider large numbers of model scenarios so that the benefits of learning are not overestimated, i.e., it is not unrealistically easy to eliminate all models except one. One of our innovations is to use so-called “variation-reduction” techniques including “Latin Hyper-Cubes” (LHC) to spread out the model scenarios so that fewer can safely be used. Also, learning can be sped up by using multiple systems at-one-time being one type of exact multi-task learning as indicated by the figure below.


When Duff (2002) proposed what were perhaps the first control schemes that can perfectly minimize expected future costs while it is learning, the optimality proof was assumed and not given. We give the proof. The basic idea is illustrated below. An intelligence with multiple models in a world in which its state is known is equivalent to a single-model intelligence in a larger world with an unknown position. Then, the policy can be developed using Partially Observable Markov Decision Processes (POMDP), which were first solved optimally in 1973 by Smallwood and Sondik. POMDP methods are important in many types of Artificial Intelligences.


Image courtesy Theodore Allen.

In developing learning control systems, one issue is the computational speed to generate the learning control policy. Another issue is the speed at which the developed policy eliminates parametric uncertainty. Policy features that we propose and study include the LHC model scenario selection, reward-based learning, and exact multi-system learning. These features make it more difficult to train the control system. A question becomes, how can we measure the benefits in terms of faster learning? To answer this, we propose the Median Estimated Learning Time (MELT). The MELT relates the average median number of periods until the top model scenario reaches a threshold level of probability. The figure below shows three simulations leading to a MELT of 3. Sometimes, a good control system might have a high MELT because a desirable policy can be developed without resolving which model is accurate. In our article, we illustrate how our control system can likely save millions of dollars in maintenance costs at a major university.


Image courtesy Theodore Allen.

These findings are described in the article entitled Reward-based Monte Carlo-Bayesian reinforcement learning for cyber preventive maintenance, recently published in the journal Computers & Industrial Engineering. This work was conducted by Theodore T. Allen and Enhao Liu from The Ohio State University and Sayak Roychowdhury from the Indian Institute of Technology.