Transcript/Topological and statistical tools for understanding system decompositions

From PKC
Revision as of 02:54, 29 March 2023 by Benkoo2 (talk | contribs) (Created page with "SPEAKER A You. This video explains topological and statistical tools for understanding system decompositions, especially those systems that are built from sensors. This is joint work with a number of people in various roles over the past few years, including two separate DARPA programs DARPA Safe Docs and DARPA Simplex You. All of this talk is based on the idea that systems of systems are best described by diagrams, directed graphs in which the nodes are labeled with sp...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

SPEAKER A

You. This video explains topological and statistical tools for understanding system decompositions, especially those systems that are built from sensors. This is joint work with a number of people in various roles over the past few years, including two separate DARPA programs DARPA Safe Docs and DARPA Simplex You. All of this talk is based on the idea that systems of systems are best described by diagrams, directed graphs in which the nodes are labeled with spaces of possible values and the edges are labeled with functions. Given this insight, and given the insight that systems engineers like to draw diagrams to describe their systems, it is best to use the mathematics which is suited to diagrams, that is, topology, sheaths and categories. Once you have a diagrammatic framework and the correct mathematical tools for working with them, you can obtain performance guarantees, mathematical proofs and testable hypotheses that can be tested with statistics that provide provable statements about the validity of your models and how good your data happened to be. This works because category theory describes the structure commonality between your different kinds of data types, topology describes the multi way relationships and given these together, sheaths are the canonical computable representation of multisource data and models. In fact that's a theorem sheaves are the canonical way to represent multiple systems put together. And what this means is that chiefs are able, because of their unique position, mathematically, they're able to find defects between data and models. They're able to locate misinformation and deception. They're able to identify unwanted feedback loops that are present in belief propagation networks. And they're able to identify inferential vulnerabilities and self inconsistencies between system requirements and the actual reality of the system. This yields a variety of new capabilities for heterogeneous systems and models. They can be mathematized in a way that there is no out of band information. The math captures all of the relevant description of your system and it's worth noting this is not just another universal modeling language, but in fact the canonical correct one. All other universal modeling languages are actually a reflection of the underlying sheath theory. This canonicity in fact goes a bit further and is constructive. You can store the models and the data directly in machine readable formats and in fact the software to do this is already written and tested. This mathematic of your system and its data reveals certain non classical hidden inferences that are available and hypotheses you might make. And without this detailed mathematication, these features cannot be defined. In this talk we will talk about the consistency radius which detects deficiencies in data and allows you to do data fusion and the Dalker probabilities that detect anomalous patterns of observations and behaviors. There is also the possibility to characterize local structure using local homology, although that will not be discussed in this talk. To put a rather concrete model in your mind, let us talk about the situation of identifying and locating a hidden transmitter, a hidden radio transmitter. This is frequently an activity done by amateur radio enthusiasts. And because amateur radio enthusiasts are always tinkering, there are a variety of sensors that are typically deployed on a given Fox hunt. Some of these sensors can find bearings, some of them can find signal strength, some of them can find GPS location, and some of them find a mixture. This is a classic heterogeneous data fusion problem. Given a variety of sensors, can you find where the Fox transmitter has been hidden? Let's consider the case of bearing sensors. A given bearing sensor has the ability to identify which direction the signals are coming from. It does so by breaking an asmutel symmetry. Even though you might have a symmetric antenna, amateur radio operators know that if you place the antenna near your body, you could break symmetry and identify which direction the signal comes from. Of course, you could use a directional antenna like the one shown at the bottom here. Of course, there is the difference between these theoretical antenna patterns and reality. From the mathematics of sheaves, it turns out it doesn't matter. You can use either the simulated version or for higher fidelity, you can use the measurement data. In either case, you end up with bearing as a function of sensor position. And in fact, it's not just a function of sensor position, but also position of the hidden transmitter, which you do not know. But if you were to know it, if you were to know the Fox position and the sensor position, you'd be able to compute the bearing. And of course, you're able to forget the sensor position and know just the Fox position. Or forget the Fox position and just know the sensor position. Indeed, you have a diagram of the kind that I mentioned. Now, this is for one bearing sensor, and it's a rather simplistic diagram. But it's easy to build a more complicated one for two sensors. In fact, if you make the assumption that these two sensors are receiving the same Fox transmitter, you could express that diagrammatically by making the two diagrams share the Fox position. Now, if you label this diagram not just with the spaces of possible values and the functions, but also the actual values that were recorded, those occur on some of the nodes on the top. In particular, the left top and right top nodes get labeled with the sensor's positions and the bearings they've recorded. The other three nodes are unlabeled. You could estimate which labeling gives you full consistency across the diagram. This full consistency in the Sheaf language is called a global section. And in fact, global sections of this very simple Sheaf diagram correspond exactly to where the two bearings have sight lines that intersect at the Fox transmitter. If there's some systematic noise, then they won't intersect at the Fox transmitter, but they'll intersect near. How near? Well, that's governed by how much noise. Now, you might say, of course, I could have done this without all of this fancy diagrammatic language. But I don't have to tell systems engineers that the diagrams can get a lot more complicated than this. The amazing thing is that this process of finding global sections is actually mathematically the same, that is, the same process for finding data fusion. And in fact, the the amount of error that is present between your data and model can still be estimated using something in the sheath theory called consistency radius. And in fact, data fusion generally in any sensor system, this is in fact a theorem is simply the process of minimizing the consistency radius across the entire diagram. So in our situation where we have two bearing sensors on the left, if the bearings are marked by arrows at the sensor positions, as shown, it's pretty clear that the sensor positions and their bearings have sight lines that intersect at the place where there is in fact a minimum of consistency radius. That dark blue circle in the middle. On the other hand, if the bearing sensors don't really correspond to reality, they're in opposite directions because perhaps the sensors are broken, then you will notice that there is no minimum. And so in fact, data fusion is impossible. I will say at this point that it is a theorem that this in fact generalizes to all sensor deployments and there is a software library available for your immediate use should you have some data fusion that you need to do. Now what if you don't have a diagram to start with? Well, you might have a system L diagram that could be cajoled into the form I was just showing you. But if you have absolutely no diagram but you're a systems engineer, very likely you have a design structure matrix. A design structure matrix has subsystems as rows and shared state variables as columns or some other sort of thing like that. And you can construct an appropriate diagram by way of the Dowker complex. The Dowker complex exposes topological features of this design structure matrix and it works like this. Each row specifies a vertex, these are the subsystems and each column specifies a face in the diagram. For instance, here column two specifies a face that includes subsystems A and C. It's an edge, as you might imagine, but higher dimensional faces are also permissible. So here column four identifies a subsystem linkage between subsystems C, D and E, a higher dimensional face. It's a kind of hypergraph as it happens. Now, interestingly enough, if there are places where you could have placed a face but didn't say for A, B and C shown here, this witnesses the presence of a potential design constraint violation. In fact, these are often unwanted feedback loops that appear this way or perhaps they are control feedback loops that you want. In either case, it's good to know that they're there and good to be able to systematically detect them. This can be done topologically. Now, in fact, if you want a diagram of the form we've been discussing, you can have one of these. By enriching the Dalker complex a bit into what's called the Dalker Cocheaf, there is a systematic procedure for constructing these from a design structure matrix, and it's actually quite similar to the formal concept lattice, if you know what that is. So this representation now is actually in a very good situation to do a variety of statistical things as well. The idea is fantastically general and allows statistically robust detection of topological features and categorical data of all kinds. On DARPA Safe Docs, where we've drive tested this methodology, we are attempting to classify parser behaviors elicited by files of different kinds. In this case, instead of having subsystems, we have them as features. We have messages, perhaps messages that are produced, error messages, or perhaps status messages produced when certain files, the columns, are ingested by various parsers. The idea is that you'd like to be able to do some anomaly detection or some unsupervised clustering or perhaps some comparison and curation of datasets. All of these things can be supported and are supported by the software tools we've written that are grounded in the mathematics of the Dalker Koshief. This Dalker Koshief is easily constructed by way of a library that we've written, and the main tool in this is called the Dalker Nest function. There's a demonstration video if you're interested in seeing this. But the general idea is that it collapses your design structure matrix into patterns of features and patterns of observations. The patterns of features are things like regular expressions that built your messages, and the observations are things like the file IDs and what dialect or subcorpus they're from. But in fact, the library is quite general. You don't have to do just files and parsers. You can use whatever you like, including subsystems, shared state variables, and anything else that comes with boolean valued features. In fact, once you have this in hand, you can ask what is the probability that a file from a given dialect exhibits as a behavior, a set of messages? This is easy enough statistically if we assume that messages are independent once we've conditioned on dialect. And unlike many statistical assumptions that are not quite benign, when this assumption is violated, it actually helps you identify dialect boundaries. So if you don't actually know the dialects from the outset, you can use violations of this idea to identify where the dialects actually are. With this in mind, once you've grouped the files by message pattern, by building the docker complex in particular, you can assign a probability that that file is in a given dialect, given the fact that you've seen a certain pattern of messages. And then it's a straightforward matter to threshold this probability in order to classify files. The formula on the screen is nothing other than #####'# rule. It's not quite deep. So, for instance, here is an example in which we've classified a variety of files by way of identifying which parsers have rejected the files. And you can see that there are patterns of features. And those patterns of features yield as the first column a variety of possible probabilities for each of the different dialects. Those are the remaining columns. If you focus your attention, say, on the bug three column or bug three row rather, this identifies files that crashed exactly the bug three parser. It's not important to know what that is, but you can already do some data exploration and realize that about 94% of these files were in the Instigator Dialect and the remaining 6% were in the TMG Dialect. Unlike most other statistical tools that look at features on their own, we're looking at features jointly. So we can discriminate between the situation in which bug three occurred on its own exactly. And when bug three occurred in the concert with other features. So here, if we consider all of the files that crashed bug Three and crashed GDAL info, then in fact we can say with certainty that these files came from the Instigator Dialect. This kind of inferential power can be tested, of course, and actually does very well in testing. It's best if you have some empirical training data so that you can tune all of the various probabilities that you saw in the previous slides. This gives you the best precision in recall. If you don't have substantial training data, you could bootstrap into decent performance by way of using the theoretical independence model that I showed you earlier. Of course. What is the standard of care right now? The standard of care right now is simply to reject files that throw too many errors. These must be bad files, right? As it turns out, wrong. Our method is much better. Now, it's worth noting that there are plenty of other tools that we've built that use this idea that give you a variety of additional power, including the ability to visualize your data in a nice way, the ability to curate file data sets by way of asking the distance between entire data sets. And if you do not know the dialects, you can discover them from the data. If you are interested in these capabilities, feel free to drop me a line or have a look at any of these references, or play with the software. Thank you for your time.