1 Introduction
2 IoT Architecture and Existing Issues
3 Existing Self-healing Solutions
4 Self-healing Approach for AMI IoT Architecture
4.1 Sensing Approach (SenS)
-
SenS in the end-user environment: It includes detection of battery level of all devices and sensors, data detection coming from nodes to the gateway, security level, network stability, connection to the cloud nodes, the gateway’s processor state, and the services’ state.
-
SenS in the middle components: The SenS will be directed to the servers establishing the connection between the peers from the user environment to the cloud environment, the level of security, the services, and the working state(on/off).
-
SenS in the cloud environment includes the security level, the services, data retrieval, data insertion in the database, the working state of the database, the network, the storage, and the connection to the cloud nodes.
4.2 Awareness of Issues (AwaR)
-
AwaR in the End-user environment: The IoT infrastructure in the end-user environment is exposed to numerous issues preventing the data to achieve its main course. To face those challenges, AwaR relied on SenS to monitor the component and crosscheck the gathered information with the knowledge at hand. As an example, the battery level is one of the most prevailing issues in IoT. If the battery level falls below 50%, it means the battery is going to run out soon and there will be a discontinuity in the service. Consequently, a notification will be sent describing the sampling (i.e., data acquired by the sensors) as being too high or otherwise the encoding is too consumptive. * Services used by the gateway are another kind of issue. A service state is either on or off. Thus, an “off-state” defines the service as broken and impacts the transmission depending on how high the influence of the service is. Listeners are then applied to services that are vital to the system and upon a failure, the system notifies the issue stating the disabled state of the affected component. * Network state in one of the multiple issues invading the IoT architecture. The network might cut or appear as working but not working which results in the discontinuity of the data (i.e., Data stops being transferred from sensors to gateways or from gateways to cloud nodes). In this case, being aware means being able to notify that something is wrong with the connectivity across the AMI-Platform. On the security level, the firewall is deployed on the end-user side (i.e., gateways). Then, listeners to these firewalls have been developed to check the availability of the rules, a change in those rules, an intrusion into the system, communication with other peers, data anonymity and data leakage. Additionally, human interaction is strictly limited to avoid any change in the operating environment.
-
AwaR in Cloud architecture: it represents the core of the AMI-Platform. It’s all the technologies and methods put together to enable a peer for each environmental node and the storing of the data in the database. Due to the knowledge accumulated and the SenS component, listeners applied to services will notify upon failure and point out the exact service/component which is having an issue. Network issues are rare at this point and battery issue is nearly nonexistent. On the security level, listeners are developed and applied to the firewall to notify at the slightest change in the firewall table and an intrusion that occurred.Regarding the security of the data in the database, everyone who is not allowed to access the data will be reported as soon as an attempt is made. Since the peer (i.e., cloud nodes getting the data from the gateway in the end-user environment) is virtualized, when the peer system encounters a high-level failure, the system should also notify the situation in an optimal amount of time. Another critical point is the storage of the data. AMI-Platform deployed listeners to notify when the storage capacity on the cloud servers reached a certain point so that servers will not be overloaded before actions are taken.Regarding human interaction, a pipeline has been deployed to notify all the programmers when a code has been uploaded. Validation of the code is then processed to check the continuity with the overall code. In this way, there will be fewer co-programming issues and likewise fewer failures upon deployment.
-
AwaR in Network: Named in IoT architecture, the weak link, due to its public nature, it can be subject to many issues mentioned in the previous subsection. Listeners have been improved and applied to the security level for the firewall table. Notifications will be sent upon addition, subtraction or change in the table and for intrusion detection in the system. The system will notify as well when there is a missing link between the end-user node and the peer in the cloud. Services that are monitored by the SenS will be applied listeners as well to make the system aware of the availability of the services.
4.3 Responsibility and Actions (ReAct)
-
ReAct in the End-user environment Concerning the battery level, actions will be taken dynamically on the sampling ratio or the encoding format of the data. Therefore, the battery consumption will drop, and the battery life will be maintained for a longer period. As for services that are monitored, they can be restarted upon failure. Regarding the network state, the link can be reestablished after an idle time. Afterward, notification is sent upon resolution. As for security, AMI-lab deployed a set of agents responsible for reconfiguring the firewall table upon a change detection and logging out a user upon an intrusion detection and reconfiguring the whole gateway environment to ensure the same level of security is respected. Other agents are deployed for data detection and transmission and the communication with the other peer in the cloud.
-
ReAct in Cloud architecture for this part, agents are used in the same way regarding the security level, acting upon change detection in the firewall table, intrusion detection, service failure. Regarding the storage management, agents that have been deployed will clean up space and make more space than notify about what happened and which course of action has been taken to deal with the ongoing issue.
-
ReAct in Network Agents have been deployed much more regarding the security level.
5 Testbed Implementation
5.1 System Architecture
5.2 Results and Analysis
-
Internet connectivity: It ensures data is not lost and sent properly and keeps track of all the downtimes and uptimes.
-
Connectivity between the gateway and the virtual gateway: Ensures the connection between the physical nodes and the virtual nodes in the cloud to have a reliable data communication channel.
-
Services: They are responsible for collecting the data, modeling the data and pre-processing it before publishing it through MQTT.
-
CPU load: Ensures the gateways are not overloaded.
-
Storage: The Kubernetes system produces a lot of data which can, once reached a threshold, hinder the proper working of the system.
-
Data coverage (DC): Is the Acquired Data (ACD) over the Awaited Data (AWD). This metric is used to verify the successful data acquired within a timeframe.
-
Failed Self-Healing Rate (FSHR): It reflects the failed actions (NAF) over the total of actions taken (NTA). It represents the misunderstanding of the system.
-
Successful Self-Healing Rate (SSHR): It is the successful actions (NAS) over the total number of actions (NTA). It represents the understanding of the system and how much of the actions have been taken to fix an issue.
-
Self-Healing Model Accuracy (SHMA): It is the measure of the overall Self-Healing component which determines the accuracy of the Self-Healing component design.
Métrics | Hits |
---|---|
ACD | 73920 |
AWD | 1774080 |
DC | 0,04(4%) |
Métrics | Hits |
---|---|
ACD | 1331057 |
AWD | 1774080 |
DC | 0,75(75%) |
NAF | 1 |
NTA | 92 |
FSHR | 0,010(1%) |
NAS | 91 |
NTA | 92 |
SSHR | 0,98(98%) |
SHMA | 0,99(99%) |