A Final year project on “Spot Instance Price Prediction using Machine Learning” was submitted by Harshal Taori (from Shri Ramdeobaba College of Engineering and Management Nagpur, India)to extrudesign.com.
An auction-based cloud model is followed in the spot pricing mechanism, where the spot instances charge changes with time. The user has to pay for the time that is initially initiated. If the user ends before the sessional hourly completion, then the customer will be billed on the complete hourly session. In case Amazon ends the instance time then the customer would not be billed for the partial hour. When the current spot price reduces to bid price without any warning the cloud provider ends the spot instance, it is a big disadvantage to the time of the availability factor, which is highly important. Therefore, it is crucial for the bidder to forecast before engaging the bids for spot prices. This paper represents a technique to analyze and predict the spot prices for instances using machine learning. It also discusses implementation, explored factors in detail, and outcomes on numerous instances of Amazon Elastic Compute Cloud (EC2). This technique reduces efforts and errors for forecasting prices.
Calculation of cloud offers shared assets reachable over the Internet network. Programming and Hardware are both contained distributed computing assets. The various models of cloud administrations are Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Network as a Service (NaaS). The cloud stage has different arrangement models, they are-public cloud, private cloud, hybrid cloud, and community cloud. Assets are empowered by the cloud-based on pay more only as costs arise model. The significant supplier for the calculation of cloud-related assets is Amazon.
Amazon offers three instances: Spot Instances, On-demand Instances, and Reserved Instances, around 25 regions across the globe.
Spot Instances follows dynamic pricing, which makes it exclusive and distinguishable. Dynamic variation of spot instances is built on request and delivery of cloud-related services in the data centers. Using an online auction platform, clients request a bid to obtain spot instances. The auction platform offers an analysis to determine the market clearance price or spot price if the consumer bids above the foresaid price, a spot instance is obtained. Cloud providers deliver up-to-date and recent spot price data to help clients in the bidding process. It provides access to the web-based API for bidding spot instances. There are certain parameters to spot instance bid requests, they are as follows:
- Number of Instances
- Availability Zone
- Instance Type
- Bidding Amount based on price/instance/per hour
2. Technique for pricing spot instance
Due to the fluctuations in prices and sudden termination possibility of instances, Amazon EC2 spot instance pricing becomes very complicated. Termination may be triggered by either the customer or the cloud organization.
The phases in the spot price billing system followed are:
- User bids for one spot instance (in a specified availability zone) for a particular machine type.
- Only when the bid price is greater than the current price, a spot instance is obtained.
- A spot price is set as the cost of getting an instance on setup.
- Pricing algorithm for spot instances works on an hourly basis in spite of the continuous variations, both user and cloud provider can interrupt computation.
- The user is supposed to pay the entire hour for the estimate, even if the user interrupts the instance without the completion of the hour.
- User doesn’t have to pay if the interruption is performed by the cloud provider. The user’s partial utilization will not be charged.
- After an hour has been finished, the user has to pay for the computation and at the start of the new hour, a new price is set.
- Bid prices remain the same once the instances are created.
Two important conclusions can be drawn from the above steps. Firstly, in case the cloud provider performs the termination of instance, the user gets free of the cost of computational services. Cloud service termination of each instance results in a later start of a new instance. New instance boot time isn’t included in the simulation’s successful execution time.
2.1 Literature Riview
One of the areas of research in today’s world is spot price prediction. Profit for the users is directly related to the accurate results of spot prediction. To save the computation time for a specific task and for significant cost reduction of resource usage, accurate price prediction is highly important. In recent years, there has been quite a development in progressing predicting models.
The creators distributed a paper utilizing the slope drop calculation to assess the weighted information coefficients from the previous month.. This calculation is the reformulation of a straight condition into a quadratic minimization issue. The contrast between the anticipated worth or result and the real spot cost is our determined blunder. The article utilizes a sort of counterfeit neural organization, a multi-facet perceptron. Results acquired the show, ANN is a decent way to deal with foresee the spot costs and can be helpful to clients while offering for the spot occurrences. The writers likewise proposed a calculation in the article for utilizing accessible registering groups to enhance the expense and season of running reproductions. It is feasible to accomplish the time required or alluded to as the effective time at a predetermined least reference value point. Adaptation to internal failure is very required, as the process assets are not ensured to spot occasions and can be halted anytime. Calculation results are generally lost due to the sudden end. In the event that no issue open-minded methods are applied the unwavering quality reductions.
In checkpointing techniques have been discussed by the authors, it also includes work migration techniques. This can be used to minimize monetary costs and maximize reliability. The demand for reserved and on-demand instances increases spot instances prices. As a result, there is a sudden revocation of spot instances. So as to increase the total cost of execution, making high bids is a possible solution.
The study explores increasing the predictability of pricing, reducing the risks. For Amazon Web Services (AWS) EC2 instances study includes pricing over specified time-interval using the machine learning and LSTM techniques (long short-term memory).
In the use of Elastic spot instances, there is a trade-off between the cost and the time of computation estimate. For minimizing the resource provisioning cost and unpredictability, a set of bidding mechanisms are proposed. Running instance to be preserved by the user with an increase in bid price for the completion of a task within a time limit. Users have access to set the limit to increase in the price, for the crossed limit termination of instance can be planned. In terms of job completion time and monetary costs, various adaptive checkpointing schemes have been explored. Authors have used a linear programming-based model to get a bidding scheme that is optimally randomized.
2.2 Methods for price forecasting
The time-series approach and simulation approach is mainly two types categorized as the price forecasting mechanism. The approach to time series is based on past market prices. To compute an enormous amount of data, simulation approaches can be quite expensive. Machine Learning is one of the most widely used techniques for forecasting time series-based prices of the spot case. The researchers have developed hybrid models to resolve the flaws in the individual models. Over the past few years study has shown there are various mechanisms developed by the researchers for predicting the prices. Without actually being programmed explicitly, software applications tend to become more accurate, it is through the categories of algorithms which basically is machine learning. Algorithms built to receive input data and predict output based on statistical analysis also update outputs as new data becomes available, this basically is machine learning.
Machine Learning Approaches
This approach uses many classifications technique to classify tweets/text into classes. Most widely two types of machine learning techniques are used
it is a subcategory of machine learning. It uses algorithms to train and analyze data. Then the marked informationa collections are prepared to get relevant yields when experienced during navigation with respect to supervised techniques.
The machine learning technique uses neural networks to learn and improve various computational tasks. Instead of relying on predefined training datasets, models learn by themselves. It consists of a category, and that provide the correct targets and therefore rely on clustering
2.3 Algorithm we are going to use
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, “Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.” Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output. The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
2.4 Why Random Forest
Whether you have a regression or classification task, random forest is an applicable model. It can handle binary features, categorical features, and numerical features. There is very little pre-processing that needs to be done. The data does not need to be rescaled or transformed.
Quick Prediction/Training Speed
They are parallelizable, meaning that we can split the process to multiple machines to run. This results in faster computation time. Boosted models are sequential in contrast, and would take longer to compute. Random forests is great with high dimensional data since we are working with subsets of data.
Robust to Outliers and Non-linear Data
Random forest handles outliers by essentially binning them. It is also indifferent to non-linear features.
Handles Unbalanced Data
It has methods for balancing errors in class population unbalanced data sets. Random forest tries to minimize the overall error rate, so when we have an unbalanced data set, the larger class will get a low error rate while the smaller class will have a larger error rate.
Low Bias, Moderate Variance
Each decision tree has a high variance, but low bias. But because we average all the trees in a random forest, we are averaging the variance so that we have a less biased and moderate variance model.
We have the steps to set up the spot instances. This supports the workflow for machine learning training and minimizes the loss of progress in training if spot disruption occurs.
Our goal is to introduce a configuration that has the following features
- Decouple items from computing, processing, and coding, and holding the compute instance stateless. This is to allow a quick rehabilitation and preparation after a case has been dismissed and replaced.
- Use a dedicated volume for data sets, progress training (checkpoints), and logs. The level will remain insistent and should not be influenced by instance termination.
- For the software tool, use a version control system. It ensures traceability and prevents code changes from being lost when the instance is terminated.
- Minimize computer changes in the training paper. This guarantees the training document is developed independently, and backup and snapshot activities are carried out outside of the training manual.
- Automatize the creation of replacement instances after termination, add EBS volume data set and checkpoints at launch, transfer volumes through Availability Zones, restore instance status, restart training, and terminate instance after completion of training.
4. Proposed Method
The proposed method uses Machine Learning for spot price prediction. The input to the training model is the dataset having AWS parametric labels, output is the analysis of spot prices across different regions and accuracy of spot prediction prices.
The prediction model contains a Jupyter notebook that contains code and plots for:
- Data fetching using boto3
- Data analysis
- Data insights
- Predictive data analysis
- Prediction Using Random Forest
- Feature extraction and engineering
- Data preparation
- Predictive model built
The initial stage for spot price forecasting is to get the market history provided by AWS. The first basic move is an account with AWS, allowing the use of the AWS Command Line Interface (CLI) for managed services. The approach suggested above is developed in the language Python. Random Forest is the machine learning algorithm used to train and test the model for better accuracy in prediction.
A first glance to the data we retrieved Amazon Spot Price data which was collected by a third-party. Data has 5 fields < Timestamp, Product Description, Instance Type, Spot Price, Availability Zone (AZ)>
According to data fields:
- Timestamp (TS) is a time when the tuple has been collected.
- Product Description (PD) is referring to a kind of operating system on an instance of Virtual Machine where the Operating System will be installed according to customer’s demands. It is consisting of 6 different operating systems.
- Instance type (IT) is referring to the type of VM. Since IT can be picked with respect to the business’ goal of the customer, it was taken down into wide brands with 33 unique VM types by AWS.
- Spot Price (SP) is showing the current market price for each IT and Availability Zone (AZ).
- AZ is consisting of 22 unique zones in different countries across the world.
4.1 Requirements for implementation
With the help of Python language, we can implement machine learning concepts. The latest version of Scikit-learn should be installed by the users. Users can make use of either pip or condo manager. Before implementing random forest and other machine learning algorithms, the following libraries should be installed:
- Pandas: This library is utilized for testing the datasets. Pandas is a product library composed for the Python programming language, for data analysis and manipulation.
- Matplotlib: Matplotlib is updated to plot the input data dataset and output. For plotting graphs and charts, this is the particular library of Python.
- NumPy: It is the main program using Python for mathematical calculations and scientific computing.
- Random Forest: A Random Forest is an ensemble method that performs regression and classification responsibilities with the use of more than one decision tree and Bootstrap Aggregation. Random Forest method including the bagging, comprises schooling every selection tree on a different statistics pattern in which sampling is done with replacement. The basic idea behind this is to combine multiple decision trees rather than depend on individual decision trees to determine the final production.
- Boto3: Boto is the IDE for Python on the Amazon Web Services (AWS). It lets Python developers build, configure, and control AWS resources like EC2 and S3. Boto includes an object-oriented, easy-to-use API and low-level access to the AWS tools.
4.2 Data fetching using boto3
Spot prices for 90 days are contained in a CSV file, this CSV file is the input dataset for the program code. Running the AWS Command-line Interface command on the console we obtain this CSV file which is converted from a json file. There are the following steps involved in fetching the history of spot prices from AWS:
- Provide all the necessary details to create an account on AWS, the details include details on credit card as well.
- Go to IAM (Identity and Access Management Console), once the AWS account is activated.
- Go to the “User” tab on the left side of the IAM Console to build a new account.
- Users need some policies to be attached while creating a new user. Access issues are often faced when the user is not made the admin. Attach a policy named “Administrator Access” to the user.
Upon effective development of the account, a different access Id and a hidden key are given per As the configuration settings are achieved, the installation of AWS Command-line Interface is the next step for fetching history of spot prices. The prerequisite for running AWS Command Line Interface is an installation of Python. AWS tool is used to manage AWS services, it can be used to import, customize and monitor via command line. To automate scripts can be used. For transferring files to and from Amazon S3, the AWS Command-line Interface provides simple commands. To fetch the data for the history of spot price into a json file format a command needs to run on the CLI, a sample command as below::
4.3 Data analysis
Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions.
The regression analysis uses historical data to understand how a dependent variable’s value is affected when one (linear regression) or more independent variables (multiple regression) change or stay the same. By understanding each variable’s relationship and how they developed in the past, you can anticipate possible outcomes and make better business decisions in the future.
By manipulating the data using various data analysis techniques and tools, you can begin to find trends, correlations, outliers, and variations that begin to tell a story. During this stage, you might use data mining to discover patterns within databases or data visualization software to help transform data into an easy-to-understand graphical format.
4.3.1 Rolling Statistics:-
Rolling is a very useful operation for time series data. Rolling means creating a rolling window with a specified size and performing calculations on the data in this window which, of course, rolls through the data. We found out the Rolling statistics in the Dataset for the prediction of spot prices
4.3.2 Test of stationarity: –
Time series is different from more traditional classification and regression predictive modeling problems. The temporal structure adds order to the observations. This imposed order means that important assumptions about the consistency of those observations need to be handled specifically.
We made use of the Dickey-Fuller test for stationarity to check if the dataset we have is stationary or not. In statistics, the Dickey-Fuller test tests the invalid theory that a unit root is available in an autoregressive time series model. And we found out that the dataset was not stationary
Above is the graph of the hourly spot prices for spot instance m4.large from the us-west-2 region.
4.4 Prediction using random forest
Random Forests is a substantial modification of bagging that builds a large collection of de-correlated trees and then averages them. The method uses an algorithm for induction of regression trees which combines random subspaces with bagging. This is achieved in the tree-growing process through random selection of the input variables. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split used at each node is the best split among a random subset of the features. Specifically, when growing a tree on a bootstrapped dataset
It is possible to help customers by offering the correct price range according to their demands, time of day and machine type. This information will help customers to give the best possible price and decrease their cost of cloud services usage. In future work, at the end of this process, an emergence rule can be revealed by using rule induction.
A method for predicting spot prices is proposed using the spot prices provided by Amazon during the past 90 days. Although there have been many approaches used in recent years, Random Forest Regressor is used in the proposed method to assess predictability. The approach provides an implementation for instance analyses regarding console, time, and region.
All the prices predicted with respect to their time zones are authentic and dynamically calculated.
Spot rates are something that is fetched using the AWS CLI script integrated to run our machine learning algorithm. Curves on the graphs can be seen escalating initially for a few regions and going an average constant, whereas for some it keeps increasing as the days increase.
Results can be validated from calculations mapped from the AWS instance report found on the documentation page of AWS. More regions and sizes can be calculated using the same approach for regions across different continents and for multiple volumes. As AWS keeps coming with various schemes and spot prices would differ in future trials. This particular technique helps an incomplete understanding of the relationship between time and region and their pricing.
- Amazon EC2 instance types. Retrieved from <http://aws.amazon.com/ec2/instance-types/> (Accessed in March 2015)
- What is Amazon EC2? Retrieved from <http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ > (Accessed in March 2015)
- AWS Blog Now available-New C4 Instances. Retrieved from <https://aws.amazon.com/blogs/aws/now-available-new-c4-instances/> (Accessed in March 2015)
- V. K. Singh and K. Dutta, “Dynamic Price Prediction for Amazon Spot Instances,” 2015 48th Hawaii International Conference on System Sciences, Jan. 2015
- O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir, “Deconstructing Amazon EC2 Spot Instance Pricing,” ACM Transactions on Economics and Computation, vol. 1, no. 3, pp. 1–20, Sep. 2013.
- W. Guo, K. Chen, Y. Wu, and W. Zheng, “Bidding for Highly Available Services with Low Price in Spot Instance Market,” Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing – HPDC ‟15, 2015.
- L. Khaidem, S. Saha, and S. R. Dey, “Predicting the direction of stock market prices using Random Forest.” arXiv preprint arXiv:1605.00003 (2016). To appear in Applied Mathematical Finance
- AWS spot instances Dataset, URL: https://www.kaggle.com/noqcks/aws-spot-pricing-marketAWS spot instance User guide , Amazon.com, URL : https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
Credit: A Final year project on “Spot Instance Price Prediction using Machine Learning” was completed by Shubham Dhanuka, Deepanshu Sahu, and Harshal Taori under the guidance of Prof Purshottam J. Assudani Department of Information Technology from Shri Ramdeobaba College of Engineering and Management Nagpur, India.