Case Study: IT O&M Platform for a Listed Special Materials High-Tech Firm

时间 : 2025-09-25 编辑 ::运维如诗

PART 01 Project Background

01 Customer Profile

The customer in this case is a national-level high-tech enterprise specializing in the R&D, production and sales of special materials. It is a technological leader and a leading supplier in the field of materials it belongs to, and also a listed company on the Science and Technology Innovation Board of the Shanghai Stock Exchange.

02 Pain Point Analysis

With the rapid growth of business volume, the customer’s investment in informatization has increased sharply. The variety of equipment and systems has made operation and maintenance (O&M) work increasingly complex and stressful. The expansion of factory areas and the increase in informatization investment have further intensified O&M challenges, including database deadlocks, frequent system freezes, network failures and other issues. These problems have not only reduced production efficiency but also increased customer complaints.

As equipment ages, the failure rate rises, and the risks faced by business support systems have also increased year by year. At present, the company has not established a unified O&M monitoring platform, which has led to a number of problems: the IT O&M department cannot identify and prevent potential failures in advance; it is difficult to detect failures in a timely manner when they occur; fault analysis and handling lack effective full-stack monitoring tools and rely on manual layer-by-layer troubleshooting, resulting in low efficiency. In addition, after fault troubleshooting, there is a lack of continuous accumulation and reusable knowledge records, leading to the repeated occurrence of similar problems.

Specifically, the customer’s O&M pain points are as follows:

Large-scale online resources and delayed fault detection: The complexity of O&M continues to increase, the workload of the team continues to grow, and the component information to be mastered is complicated.
Heavy workload due to numerous daily repetitive tasks: A large number of regular inspection tasks and support work are almost all completed manually.
O&M achievements are not prominent: There is no visualized management, and the health status of the entire system is only known by a few employees; problems need to be identified earlier, preferably before they are discovered by the business department.
Lack of practical knowledge: Even through self-learning, it is impossible to obtain the most valuable practical experience. The customer expects to have access to high-quality external expert resources to help solve difficulties and pain points in work and continuously improve the knowledge and skills of personnel.

In view of these challenges, the customer urgently needs a comprehensive IT O&M solution to improve O&M efficiency, ensure the stable operation of business systems, and fully guarantee the reliability of business support systems.

PART 02 Lewei Solution

To improve the customer’s informatization system, provide effective assistance to O&M personnel, and enable them to achieve O&M work with higher efficiency, it is necessary to establish an informatization-oriented monitoring platform based on the existing informatization-related maintenance work. This platform should be able to detect faults early, predict and judge faults in advance for timely handling, and make rational use of informatization infrastructure resources to maximize resource utilization. At the same time, it should provide a reasonable basis for future informatization construction, promoting the healthy development of business system informatization construction.

01 Overview of Core Functions

Combined with the customer’s existing informatization construction architecture, the deployment content of the unified monitoring platform is as follows:

Centralized Monitoring: Monitoring of availability, performance, logs and other indicators from IT infrastructure to business systems.
Centralized Alarm: Full-life-cycle management including centralized alarm display, alarm distribution, and alarm handling.
Visualized Views: Visual functions such as automatically discoverable network topology, screen projection views, and business topology.
Diversified Reports: Supporting custom, multi-dimensional, and multi-indicator report statistics functions.
Large Screen Display: Realizing custom display pages through large-screen centralized monitoring.
Network Configuration Management: Functions such as automatic configuration backup at custom intervals, one-click configuration delivery, and configuration backup comparison.
IP Management: Providing quick IP address location function, supporting the viewing of IP status, Mac address, connected devices and port information.
Automated O&M: Providing automated O&M modules for network devices and operating systems; with functions such as script management, version management, software package management, scheduled tasks, and batch delivery.
Alarm Analysis: Supporting the alarm-associated topology function to realize fault impact range analysis and quickly open the topology interface containing the resource.

02 System Architecture

It is understood that the monitoring objects this time include operating systems, network devices, databases, middleware, virtualization, servers, and storage, with a total of less than 1,000 monitoring objects. The system architecture deployment this time is defined as follows:

Architecture Description:

Visual Area

High availability is implemented between the two WEB servers, and a high-availability cluster solution is built using pcs.
The web server obtains data for display through the database VIP.
The monitoring objects are managed through the web server.

Storage Area

The primary and standby nodes of the database server implement streaming replication hot backup, and a high-availability cluster solution is built using pcs.

Processing Area

Two collection servers form a high-availability cluster solution to connect with N agent servers.
When the high-availability cluster strategy is adopted, if one collection server goes down, it will automatically switch to the other collection server in a very short time, ensuring the stability of data processing to a certain extent; when the downed host recovers, it will automatically join the cluster.
The collection server manages the collection agent servers.

Collection Area

Two collection agent servers form a high-availability cluster solution using pcs.
When the high-availability cluster strategy is adopted, if one collection agent server goes down, it will automatically switch to the other collection agent server in a very short time; when the downed host recovers, it will automatically join the cluster.
The servers monitored by each collection agent server do not interfere with each other.
The monitoring data of the collection agent server is stored locally in SQLlit, and finally uniformly stored in the primary database of the primary collection server and synchronized to the standby database.
Only the network policy needs to be enabled between the collection agent server and the collection server.
The collection agent server supports compressed transmission and encrypted transmission.
The collection agent server is scalable, and agents can be added for different domains.

03 Alarm Configuration

After completing the inclusion of monitoring objects into management, through communication and training with the customer, the relevant monitoring threshold configuration is further confirmed, that is, the alarm threshold is configured according to the customer’s actual situation. When the monitoring indicator reaches the threshold setting, an alarm is triggered. At the same time, different thresholds are corresponding to different alarm levels, which are common levels such as emergency, critical, and general.

04 Large Screen Display Configuration

Screen projection display can usually be used to intuitively and concisely view the actual situation of the entire IT resources or a certain business. After communication with the customer, the following configurations are determined this time:

05 Fault Self-Healing Configuration

Fault self-healing can automatically trigger corresponding processing scripts for some common faults according to preset rules, reducing the cost of manual intervention, improving fault handling efficiency, and ensuring system stability. After communication with the customer, the following fault self-healing scenarios are determined to be created:

PART 03 Customer Benefits

Improve O&M Efficiency: Through automated O&M tools and intelligent O&M services, reduce manual intervention and improve the efficiency and response speed of O&M work.
Reduce Operating Costs: Use intelligent analysis and prediction to optimize resource allocation and system operation, reduce unnecessary expenditures, and maximize cost-effectiveness.
Enhance Business Continuity: Ensure the smooth flow of business processes through unified monitoring and management, and reduce business interruptions caused by system failures.
Improve User Experience: Real-time monitoring and rapid response mechanisms can ensure service availability and response speed, thereby improving the satisfaction of end-users.
Improve System Stability: Through indicator collection and analysis, timely identify and solve potential system problems, and enhance system stability and reliability.
Realize Fault Self-Healing: By configuring automated scripts and self-healing rules, the system can automatically repair when problems are detected, reducing the impact time of faults.
Enhance the Capability of O&M Team: Through a unified O&M management platform, improve the O&M team’s ability to control complex IT environments and reduce the complexity of O&M work.
Knowledge Accumulation and Inheritance: Record and analyze the fault handling process, accumulate experience, form a knowledge base, and provide references for solving similar problems in the future.
Close Integration of Business and IT: Through the monitoring of business topology and business services, realize the close integration of business and IT O&M, and ensure the achievement of business goals.
Enhance Competitiveness: Improve the overall operational efficiency and market competitiveness of the enterprise by optimizing IT infrastructure and O&M processes.

These benefits will help the customer maintain competitiveness and achieve sustainable development in the face of a rapidly changing market and technological environment.

O&M Practice | Lerwee Monitoring Helps Stable Operation of Medical Business
PART 01: Project Background Against the backdrop of accelera……
Case: Unified Monitoring & Alerting Platform in “Double First-Class” Uni
PART 01 Project Background A university in Shanghai is a ful……
Case: Intelligent O&M Platform Build in Major Securities Firm
PART 01 Project Background 01 Customer Profile The custom……
Case: Foreign-funded Auto Firm’s O&M Platform Build in China
PART 01 Project Background 01 Customer Profile The custom……

Case Study: IT O&M Platform for a Listed Special Materials High-Tech Firm

Lerwee