DerbySoft enables travel companies to work together through technology and innovation by providing high-performance distribution services to suppliers, distributors, metasearch engines and a variety of other travel technology providers to conduct digital commerce efficiently and effectively. The efforts from the company’s Innovation Lab, led by Xia Wei, have been beneficial with optimizing the servers already in use, finding potential obstacles and reducing variances with current DerbySoft connections.
In recent years, both the costs and the demand for global cloud computing and services have increased substantially. DerbySoft utilizes thousands of systems with Amazon Web Services, where there is a large number of applications for different business units deployed as machine clusters or individual machines. Working with such a large number of machines only increases the manual maintenance while the services DerbySoft uses with these machines continue to increase. As a result, the utilization rates of the machines are not ideal.
It has become imperative that DerbySoft optimizes the utilization of the machines and increase the productivity of the individual machines. Connectivity, in general, has become more complex adding various service components into the pipeline all at one time. If more components malfunction, there could be incredible disruption to one or more connections; and the current alarms alongside the use of manual inspections make covering all the components with the servers challenging.
DerbySoft has developed and implemented Capacity Forecast and Anomaly Detection into the existing services. These two new functionalities were developed based upon the company’s expertise with artificial intelligence and machine learning and can learn from past data to better predict and find the problems without manually setting rules.
Capacity Forecasting provides visualization and alarm services to help engineers see the capacity on a server as well as set an alarm to notify them if capacity is running short or underutilized. When an alert is signaled, the engineer can adjust based on company suggestions and actual business needs. This not only saves manpower with operation and maintenance but increases the utilization of the machines without sacrificing the quality and stability of the services from DerbySoft. See below:
DerbySoft tested the relationship between TPS and CPU utilization with a DStorage server cluster seen in the above Quantile Regression model. The red line represents the trend of CPU increasing with TPS. The yellow line indicates that in 99.99% of the cases, the TPS will not exceed the yellow line of the CPU utilization at that time. It gives engineers a very measured suggestion so they can adjust the machine capacity according to the Capacity Forecast. This adjustment will meet the needs of the business while optimizing the utilization of the machine, ensuring reliability.
Capacity Forecast focuses on the usage of the CPU, memory, disk, the number of network sockets, network bandwidth and the like. In addition to using more advanced computing and storage technologies, the Capacity Forecast of cloud services has become increasingly important.
Time Series Analysis
The use of individual service is hard to predict due to the variance of time, but the overall use is somewhat easier to predict. This is why the system is designed to predict the overall workload of cloud computing, which considers the characteristics of time series such as periodic behavior, holidays, trends and special events.
Traditional statistical forecasting methods need to adjust to the growth pattern in each case. For example, some time series exhibit linear growth, other time series exhibit exponential growth, and still other time series may have seasonal components. This challenge can be met by comprehensive forecasting methods, allowing DerbySoft to build a robust model in any time series with minimal manual parameter adjustments.
There are several statistical forecasting methods:
- — Methods Based on Linear Regression: These methods first need to be qualified using past observations to predict the upcoming workload and are classical time series models such as Autoregressive Integral Moving Average (ARMA/ARIMA) and Holt-Winters Exponential Smoothing.
- — Deep Learning Methods: These methods would include Neural Network or DeepAR.
- — Other Non-Linear Regression Methods
After studying the above methods, DerbySoft considered accuracies and training efficiencies and chose Facebook’s open-source Prophet algorithm model. Prophet is a model for predicting time series data based on a self-additive model used to fit non-linear trends such as years, weeks, seasons, and holidays. It works best using daily periodic data with at least one year of historical data. Prophet is extremely robust for handling missing values, changes in trends and a lot of outliers.
In the field of time series analysis, there is a common analysis method called Decomposition of Time Series. It divides the time series into several parts such as seasonal items, trend items and remaining items. In addition to seasonal items, trend items and remaining items, there are usually holiday impacts. Therefore, in the Prophet algorithm, the above four items are included at the same time:
Among them, g(t) stands for the trend item, which represents the changing trend of the time series on a non-periodical level. The s(t) in the equation represents the periodic item or seasonal item, generally in units of weeks or years. The h(t) indicates the holiday item, meaning there is a holiday on a certain day. The e(t) represents an error item or the remaining item (an outlier). The Prophet algorithm takes the predicted value of the time series by matching the items and adding them up. The following Facebook event data shows that the above rules are particularly suitable for the Prophet algorithm.
Derbysoft implemented Capacity Forecast services in the production environment and developed a Capacity Forecast guidance for the company’s business team to help the team reduce the server capacity. The figure below is a Capacity Forecast for a cluster.
The green dot in the figure above represents the actual API request volume in the Shop service; the solid blue line represents the predicted value, and the shaded part is the predicted confidence interval. Most of the actual access requests in the past have fallen into the predicted confidence interval. The Capacity Forecast provides a forecast value for the next 24-hours so engineers can watch the chart to decide whether to increase or decrease the machines needed.
The Capacity Forecasting service also provides further recommendations on the number of cluster machines. As shown below:
The number of machines in the forecast considers a 40% margin, and based on the forecast service, engineers have more confidence to reduce the number of servers. This reduction in servers increases the utilization rate of the machine while ensuring reliability.
Currently, DerbySoft has launched 114 machines employing Capacity Forecasting. The capacity of one cluster is projected to decrease from 12 machines to six machines; another cluster is projected to be reduced from nine to six machines, and a third cluster is expected to reduce its number by 16%. These reductions prove that Capacity Forecast works and will be implemented for more clusters and company offerings in the future.
Anomaly Detection is the process of identifying abnormal events or behaviors from normal time series and is one of the most mature applications of time series data analysis. Effective Anomaly Detection is widely used in many fields in the real world, such as quantitative trading, network security detection, self-driving cars and daily maintenance of large industrial equipment.
Consider a spacecraft in orbit. It is an expensive piece of machinery with a complex system. Any lapse in detecting a malfunction could cause serious and irreparable damages. Anomalies could become serious at any moment, so accurate and timely Anomaly Detection can remind aerospace engineers to take appropriate measures far in advance.
DerbySoft is not supporting a spacecraft, but DerbySoft does handle thousands of extremely complicated connections. Any error in one connection could cause a service failure further downstream with another OTA and why DerbySoft has already launched Anomaly Detection services to improve the quality and stability of the company’s services.
Before launching Anomaly Detection, DerbySoft created a set of alerts and alarms based on rules and fixed thresholds. These alerts and alarms were particularly important and helped the engineers discover unknown obstacles present in the connectivity services. However, there were thousands of connectivity components, and it was hard to manually track and set rules around each one. It became evident that some abnormalities could potentially be overlooked due to various reasons. At the same time, DerbySoft has thousands of clients that could potentially change some of their services without notifying a DerbySoft engineer. Anomaly Detection can learn historical data of a connection without requiring a lot of manpower to set rules, which greatly reduced the workload of engineers and increased the coverage of alerts and alarms.
The following figure is the architecture diagram of the first version of Anomaly Detection:
Anomaly Detection is achieved in the full pipeline coverage for the company’s Streamlined Connectivity services. Prometheus stores the data visualization values, and AWS S3 stores the persistent data including historical time series data and model output data. Finally, the MySQL contains all of the indicator data.
Anomaly Detection is currently in use and integrated into the DerbySoft alarm systems for Beaconfire and Slack. Teams using Anomaly Detection can set different levels of alarms for different services or customize rules in the alarm system to cater to their needs. There have been a few instances where smaller system anomalies were not picked up by the existing alarm system and detected by Anomaly Detection.
DerbySoft is deeply rooted in the spirit of innovation, and the company will always be driven by technology. The application of artificial intelligence, initiated by the Innovation Lab, has achieved great preliminary results. In the future, the team will apply machine learning to more business scenarios to provide customers with better, more efficient services and to make the travel business easier.
Read the white paper here.