Data Descriptions

data <- read_csv("online_retail_II.csv")
dim(data)

[1] 1067371       8

The data source for this project is Kaggle, an online platform that provides a wide variety of datasets for data analysis, machine learning, and AI projects. From the available datasets, we selected the “Online Retail II” dataset, which was originally compiled by Dr. Daqing Chen at the University of Cincinnati. The dataset contains 1, 067, 371 records and 8 attributes related to online retail transactions made between 01/12/2009 and 09/12/2011.

Using str(), we will get a basic idea about the data set.

str(data)

spc_tbl_ [1,067,371 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Invoice    : chr [1:1067371] "489434" "489434" "489434" "489434" ...
 $ StockCode  : chr [1:1067371] "85048" "79323P" "79323W" "22041" ...
 $ Description: chr [1:1067371] "15CM CHRISTMAS GLASS BALL 20 LIGHTS" "PINK CHERRY LIGHTS" "WHITE CHERRY LIGHTS" "RECORD FRAME 7\" SINGLE SIZE" ...
 $ Quantity   : num [1:1067371] 12 12 12 48 24 24 24 10 12 12 ...
 $ InvoiceDate: POSIXct[1:1067371], format: "2009-12-01 07:45:00" ...
 $ Price      : num [1:1067371] 6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
 $ Customer ID: num [1:1067371] 13085 13085 13085 13085 13085 ...
 $ Country    : chr [1:1067371] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
 - attr(*, "spec")=
  .. cols(
  ..   Invoice = col_character(),
  ..   StockCode = col_character(),
  ..   Description = col_character(),
  ..   Quantity = col_double(),
  ..   InvoiceDate = col_datetime(format = ""),
  ..   Price = col_double(),
  ..   `Customer ID` = col_double(),
  ..   Country = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Two of the most noteworthy are “Price” and “Quantity”, which provide quantitative information about the observations. Additionally, “Customer Id” is a nominal variable, meaning it serves as an identifier for each observation rather than a numerical value.

Now, we will see the first 6 rows of this data set for better understanding about the data.

head(data)

# A tibble: 6 × 8
  Invoice StockCode Descri…¹ Quant…² InvoiceDate         Price Custo…³
  <chr>   <chr>     <chr>      <dbl> <dttm>              <dbl>   <dbl>
1 489434  85048     "15CM C…      12 2009-12-01 07:45:00  6.95   13085
2 489434  79323P    "PINK C…      12 2009-12-01 07:45:00  6.75   13085
3 489434  79323W    "WHITE …      12 2009-12-01 07:45:00  6.75   13085
4 489434  22041     "RECORD…      48 2009-12-01 07:45:00  2.1    13085
5 489434  21232     "STRAWB…      24 2009-12-01 07:45:00  1.25   13085
6 489434  22064     "PINK D…      24 2009-12-01 07:45:00  1.65   13085
# … with 1 more variable: Country <chr>, and abbreviated variable
#   names ¹Description, ²Quantity, ³`Customer ID`

The main business challenge addressed in this project is predicting future revenue for online retail transactions. The goal is to build a predictive model that can use information about customers and their past purchases to predict which products they are likely to buy in the future.

To accomplish this goal, the project aims to answer the following questions:

What are the best products, customers, and countries in terms of highest revenue? By aggregating sales data and calculating the total revenue for each combination of product, customer, and country, we can identify the most profitable entities to focus on.
What is the trend in sales for specific products, customers, and countries?
Can we use time series analysis to predict future revenue for specific products, customers, and countries? By applying time series models like ARIMA to historical sales data, we can make predictions about future sales. Overall, with this project we can gain insights into sales patterns and make informed decisions about future sales. By understanding the sales trends and making predictions, a retail company can better plan and allocate resources, such as inventory and marketing efforts, to maximize revenue.

Data Cleaning

First, we need to clean the data to ensure that it is accurate and meaningful for analysis. We can start by converting the “Customer ID” column to a factor variable and converting the “InvoiceDate” column to a date format using the as.Date() function.

We need to filter out any rows where the “Price” and “Quantity” columns are less than or equal to zero, as these represent invalid or erroneous data.

We can calculate the total price for each transaction by multiplying the “Quantity” and “Price” columns together and creating a new column called “TotalPrice”. This will give us a new variable that we can use in our analysis.

data$`Customer ID` <- factor(data$`Customer ID`)
data$InvoiceDate <- as.Date(data$InvoiceDate, format = "%Y-%m-%d")
data <- data[data$Price > 0 & data$Quantity > 0,]
data$TotalPrice <- data$Quantity * data$Price

Next, here are some key points and interesting observations based on the summary:

summary(data)

   Invoice           StockCode         Description       
 Length:1041671     Length:1041671     Length:1041671    
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
    Quantity         InvoiceDate             Price          
 Min.   :    1.00   Min.   :2009-12-01   Min.   :    0.001  
 1st Qu.:    1.00   1st Qu.:2010-07-12   1st Qu.:    1.250  
 Median :    3.00   Median :2010-12-07   Median :    2.100  
 Mean   :   10.96   Mean   :2011-01-03   Mean   :    4.077  
 3rd Qu.:   10.00   3rd Qu.:2011-07-24   3rd Qu.:    4.130  
 Max.   :80995.00   Max.   :2011-12-09   Max.   :25111.090  
                                                            
  Customer ID       Country            TotalPrice       
 17841  : 12890   Length:1041671     Min.   :     0.00  
 14911  : 11245   Class :character   1st Qu.:     3.90  
 12748  :  7228   Mode  :character   Median :     9.96  
 14606  :  6566                      Mean   :    20.13  
 14096  :  5111                      3rd Qu.:    17.70  
 (Other):762509                      Max.   :168469.60  
 NA's   :236122

• There are 1,041,671 rows in the dataset.

• “Quantity” shows the number of units of products sold across all transactions in the dataset. The minimum quantity is 1, and the maximum quantity is 80,995. On average, there were 10.96 units sold per transaction.

• “InvoiceDate” shows the date range of transactions in the dataset. The first transaction was made on December 1, 2009, and the last transaction was made on December 9, 2011.

• “Price” shows the range of prices for products in the dataset. The minimum price is 0.001, and the maximum price is 25,111.09. On average, the price per unit sold is 4.077.

• “TotalPrice” shows the total revenue generated across all transactions in the dataset. The minimum revenue is 0.00, and the maximum revenue is 168,469.60. On average, the revenue generated per transaction is 20.13.

Exploratory Data Analysis

Exploratory data analysis is an important step in the data analysis process. It involves examining and summarizing the main characteristics of the data to gain a better understanding of its properties and discover patterns and relationships within it. One way to conduct EDA is to group the data by various variables and create visualizations to help us better understand the relationships between those variables.

In this dataset, we can group the data by different variables such as product, customer, and country. By doing so, we can identify any interesting patterns or insights that may be useful for further analysis and decision-making.

We’ll start by identifying the top 20 products that have generated the highest total revenue. To achieve this, we group the data by product description, and calculate the sum of the total price for each product. Next, we sort the products in descending order based on their total revenue and pick the top 20 products.

Product Descriptions

dd <- data %>% group_by(Description) %>% summarise(TPrice = sum(TotalPrice)) %>% arrange(desc(TPrice)) %>% head(20)
dd$Description <- factor(dd$Description, levels = dd$Description[order(dd$TPrice, decreasing = TRUE)])
options(scipen = 999)

dd %>% ggplot(aes(x = Description, y = TPrice)) +
  geom_segment(aes(xend = Description, yend = 0)) +
  geom_point(size = 4, color = "blue") +
  geom_text(aes(label = round(TPrice, 2)), hjust = -0.3) + 
  ylim(c(0, 500000)) +
  labs(title = "Top 20 Best-Selling Products") +
  ylab("Total revenue") + coord_flip() +
  theme(plot.title = element_text(hjust = 0.5))

Exploring the top customers by total purchase amount is an essential part of any exploratory data analysis. By grouping the data by customer and summing the total purchase amount of each transaction, we can identify the customers who have spent the most money.

Customer ID

dcd <- data %>% group_by(`Customer ID`) %>% na.omit() %>% summarise(TPrice = sum(TotalPrice)) %>% arrange(desc(TPrice)) %>% head(20)
dcd$`Customer ID` <- factor(dcd$`Customer ID`,levels = dcd$`Customer ID`[order(dcd$TPrice, decreasing = TRUE)])
options(scipen = 999)
dcd %>% ggplot(aes(x = `Customer ID`, y = TPrice)) +
  geom_segment(aes(xend = `Customer ID`, yend = 0)) +
  geom_point(size = 4, color = "blue") +
  geom_text(aes(label = round(TPrice, 2)), hjust = -0.3, angle = 90) + 
  ylim(c(0, 800000)) +
  labs(title = "Top 20 Customers by Total Purchase Amount") +
  ylab("Total purchase amount") +
  theme(plot.title = element_text(hjust = 0.5))

By grouping the data by country and summing the total purchase amount for each transaction, we can identify the countries that have generated the highest total revenue.

Country

dc <- data %>% group_by(Country) %>% summarise(TPrice = sum(TotalPrice)) %>% arrange(desc(TPrice)) %>% head(20)
dc$Country <- factor(dc$Country,levels = dc$Country[order(dc$TPrice, decreasing = TRUE)])
options(scipen = 999)
dc %>% ggplot(aes(x = Country, y = TPrice)) +
  geom_segment(aes(xend = Country, yend = 0)) +
  geom_point(size=4, color="blue") +
  geom_text(aes(label = round(TPrice, 2)), hjust = -0.3) + 
  ylim(c(0, 25000000)) +
  labs(title = "Top 20 Countries by Total Revenue") +
  ylab("Total revenue") + coord_flip() +
  theme(plot.title = element_text(hjust = 0.5))

Filtering Data

In this stage of the analysis, new datasets will be created that focus on the top products, customers, and countries with the highest revenue. By filtering the data in this way, these datasets can be used to create visualizations and gain insights into the top revenue-generating areas of the business. From here, we can dive deeper into specific products, customers, and countries to identify patterns and trends that may be influencing revenue.

Top Product

We focused on a specific product, the “REGENCY CAKESTAND 3 TIER”, which generated the highest revenue. We created a new dataset to ensure that the data related to this product is complete and ready for analysis. We filled in any missing data by using linear interpolation, which estimates missing values based on the pattern of the existing data, allowing us to have a complete dataset.

We then create a plot showing the “TotalPrice” of the product over time using the invoice dates.

ddf <- data %>% filter(Description == "REGENCY CAKESTAND 3 TIER") %>% group_by(InvoiceDate) %>% summarise(Dtprice = sum(TotalPrice))
x <- as.numeric(max(ddf$InvoiceDate) - min(ddf$InvoiceDate))
ddf <- ddf %>%
  complete(InvoiceDate = seq.Date(min(InvoiceDate), max(InvoiceDate), by = "day")) %>% 
  mutate(Id = c(1:(x + 1))) %>% 
  fill(Id) %>%
  mutate(Dtprice = na.approx(Dtprice, na.rm = FALSE))

ddf %>% ggplot(aes(x = InvoiceDate, y = Dtprice)) + geom_line() + labs(x = "Invoice Date", y = "Total Price")

Top Customer ID

To analyze the spending behavior of our top customer, we have created a new dataset that specifically focuses on the customer with the highest revenue by imputing missing values using linear interpolation, we ensured that the dataset is complete and ready for analysis.

A plot is then created to show the “TotalPrice” of the customer over time as indicated by the invoice dates.

dcdf <- data %>% filter(`Customer ID` == "18102") %>% group_by(InvoiceDate) %>% summarise(DCprice = sum(TotalPrice))
x <- as.numeric(max(dcdf$InvoiceDate) - min(dcdf$InvoiceDate))

dcdf <- dcdf %>%
  complete(InvoiceDate = seq.Date(min(InvoiceDate), max(InvoiceDate), by = "day")) %>% 
  mutate(Id = c(1:(x+1))) %>% 
  fill(Id) %>%
  mutate(DCprice = na.approx(DCprice, na.rm=FALSE))

dcdf %>% ggplot(aes(x = InvoiceDate, y = DCprice)) + geom_line() + labs(x = "Invoice Date", y = "Total Price")

Top Country

We have created a new dataset that focuses on the country that has generated the highest revenue, which is the United Kingdom. To make sure that this dataset is complete and can be used for analysis, we filled in any missing values of date and corresponding total price using linear interpolation.

After that, we created a plot that shows the total spending of the United Kingdom over time based on the invoice dates.

dcf <- data %>% filter(Country == "United Kingdom") %>% group_by(InvoiceDate) %>% summarise(Dprice = sum(TotalPrice))
x <- as.numeric(max(dcf$InvoiceDate) - min(dcf$InvoiceDate))

dcf <- dcf %>%
  complete(InvoiceDate = seq.Date(min(InvoiceDate), max(InvoiceDate), by = "day")) %>% 
  mutate(Id = c(1:(x+1))) %>% 
  fill(Id) %>%
  mutate(Dprice = na.approx(Dprice, na.rm = FALSE))

dcf %>% ggplot(aes(x = InvoiceDate, y = Dprice)) + geom_line() + labs(x = "Invoice Date", y = "Total Price")

Model Selection

In this section, we will perform time series analysis on our top product, customer and country using various statistical techniques such as stationary test, ACF and PACF analysis, ARIMA model, and ETS model. By using a combination of these techniques, we will be able to identify any underlying patterns and trends in the sales data and make accurate forecasts for future sales.

Stationary Test

If a time series exhibits no trend, constant variance over time, and constant autocorrelation over time, it is considered to be “stationary” (Witt, 1988).

ACF and PACF

The association between a time series and a delayed version of itself is known as autocorrelation (Demir, September 2020). The ACF measures the correlation between a time series and its lags, while the PACF measures the correlation between a time series and its lags after removing the effects of the intermediate lags. The ACF and PACF plots can help identify the order of the ARIMA model, specifically the number of autoregressive (AR) and moving average (MA) terms.

ARIMA Model

The Auto-Regressive Integrated Moving Average, or ARIMA, model effectively develops a linear equation that explains and predicts the time series data (Chen, October 2008).

It is a combination of autoregression (AR), integration (I), and moving average (MA) models. The ARIMA model assumes that the future values of a time series can be predicted based on its past values and the errors of the previous predictions. The model is defined by three parameters: p (order of the AR model), d (degree of differencing), and q (order of the MA model).

ETS Model

To predict future values, this model makes use of weighted averages of earlier observations (Jain, 2017). Exponential smoothing is a way to give more importance to recent data and less importance to old data. This helps to capture any changes that may have happened in the recent past that could affect the future values. The ETS model combines the effects of errors, trends, and seasonality in a way that helps to forecast future values. There are three parameters in the ETS model that determine how these components are combined: error type (additive or multiplicative), trend type (none, additive, or multiplicative), and seasonality type (none, additive, or multiplicative). These parameters are chosen based on the characteristics of the data and the business problem being solved.

Time Series Analysis

We will begin by conducting a stationary test to check if the time series data has constant statistical properties over time, which is a necessary condition for many time series models. Then, we will use ACF and PACF analysis to determine the order of the ARIMA model. We will also use ETS model, which stands for Error, Trend, and Seasonality, to account for the trend and seasonality present in the data.

For Product “REGENCY CAKESTAND 3 TIER”

Stationary Test

The Augmented Dickey-Fuller (ADF) test is a popular statistical test used to determine if a time series is stationary or not.

dc1 <- ts(ddf$Dtprice)
adf.test(dc1,alternative = 'stationary')


    Augmented Dickey-Fuller Test

data:  dc1
Dickey-Fuller = -5.317, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary

The test returns several values including the Dickey-Fuller test statistic, which has a value of -5.317. The negative value indicates that the data is likely to be stationary.

The Lag order is 8, which represents the number of lagged differences used in the test. The higher the lag order, the more differences are taken, which can make the data more stationary.

The p-value of 0.01 is less than the significance level of 0.05, suggesting that the time series data is likely to be stationary.

Plots for order of AR and MA

In order to select the appropriate order of AR and MA models, it is helpful to examine the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the data.

acf(dc1)

pacf(dc1)

ARIMA Model

When modeling time series data using an ARIMA model, selecting the right values for the model parameters is critical for obtaining an accurate and reliable forecast. To help us choose the right parameter values, we can use the auto.arima() function in R, which automatically selects the optimal values for p, d, and q based on the AIC (Akaike Information Criterion) and other model selection criteria. The AIC is a measure of the relative quality of a statistical model for a given set of data, with a lower value indicating a better fit.

However, auto.arima() may not always select the best parameter values for complex or unusual time series data. Therefore, we can loop through some possible combinations of p, d, and q values to search for the best combination of parameters that minimizes the AIC, this approach allows us to refine our search for the optimal parameter values beyond what auto.arima() initially provided.

To make this process more efficient, we can use the lowest AIC obtained from auto.arima() as the initial parameter for our search, this can help reduce the number of parameter combinations we need to explore and speed up the search process. Additionally, we can also use the partial autocorrelation function (PACF) and autocorrelation function (ACF) plots to identify potential values for the AR and MA components of the ARIMA model.

fit0 <- auto.arima(dc1)
min_aic <- AIC(fit0)
min_order <- c(1, 1, 1)

# Loop through all combinations of p, d, and q values
for (p in 1:2) {
  for (d in 0:1) {
    for (q in 1:14) {
      # Fit ARIMA model with current order
      fit <- arima(dc1, order = c(p, d, q))
      
      # Calculate AIC for current model
      aic <- AIC(fit)
      
      # Check if current AIC is lower than previous minimum
      if (aic < min_aic) {
        # Update minimum AIC and corresponding order
        min_aic <- aic
        min_order <- c(p, d, q)
      }
    }
  }
}

# Fit final ARIMA model with minimum AIC order
final_fit <- arima(dc1, order = min_order)
AIC(final_fit)

[1] 9905.684

By combining auto.arima() with the code above, we can strike a balance between computational efficiency and model accuracy. This approach allows us to obtain the best possible model for the time series data.

The Ljung-Box test is a statistical test used to check whether there is any remaining pattern or structure in the residuals of a time series model, which could indicate that the model is not accurately capturing all the information in the data. This test was performed on the residuals of the ARIMA(2,1,11) model.

checkresiduals(final_fit)


    Ljung-Box test

data:  Residuals from ARIMA(2,1,11)
Q* = 6.1436, df = 3, p-value = 0.1048

Model df: 13.   Total lags used: 16

fcast <- forecast(final_fit, h = 100)
autoplot(fcast)

de1 <- accuracy(final_fit)

ETS Model

We will fit an ETS model to our time series data, once the model is fitted, we will execute the checkresiduals() function and will check the residuals of the fitted model to assess the quality of the fit. It will help us identify any patterns or trends in the data that the model may have missed and evaluate whether the model is suitable for forecasting.

# Forecasting by ETS method
fit1 <- ets(dc1)
checkresiduals(fit1)


    Ljung-Box test

data:  Residuals from ETS(M,A,N)
Q* = 37.231, df = 6, p-value = 0.000001587

Model df: 4.   Total lags used: 10

# Forecasting with ETS method 
fcast1 = forecast(fit1, h = 100)
plot(fcast1, xlab = "Time", ylab = "Price")

summary(fit1)

ETS(M,A,N) 

Call:
 ets(y = dc1) 

  Smoothing parameters:
    alpha = 0.1594 
    beta  = 0.0001 

  Initial states:
    l = 330.5175 
    b = 21.1804 

  sigma:  0.8502

     AIC     AICc      BIC 
12168.45 12168.55 12190.72 

Training set error measures:
                    ME     RMSE      MAE       MPE     MAPE      MASE
Training set -102.5985 630.5628 420.8989 -170.8606 190.6208 0.9754222
                  ACF1
Training set 0.2213669

de2 <- accuracy(fit1)

For Customer ID “18102”

Stationary Test

dc2 <- ts(dcdf$DCprice)
adf.test(dc2, alternative = 'stationary')


    Augmented Dickey-Fuller Test

data:  dc2
Dickey-Fuller = -5.6536, Lag order = 9, p-value = 0.01
alternative hypothesis: stationary

The test returns several values including the Dickey-Fuller test statistic, which has a value of -5.6536. The negative value indicates that the data is likely to be stationary.

The Lag order is 9, which represents the number of lagged differences used in the test. The higher the lag order, the more differences are taken, which can make the data more stationary.

The p-value of 0.01 is less than the significance level of 0.05, suggesting that the time series data is likely to be stationary.

Plots for Order of AR and MA

acf(dc2)

pacf(dc2)

ARIMA Model

The Ljung-Box test is a statistical test used to determine whether the residuals (i.e., the differences between the actual values and predicted values) of a time series model are correlated. In this case, the test was conducted on the residuals of an ARIMA(2,0,2) model with a non-zero mean.

fit0 <- auto.arima(dc2)
min_aic <- AIC(fit0)
min_order <- c(1, 1, 1)

for (p in 1:2) {
  for (d in 0:0) {
    for (q in 1:14) {
      fit <- arima(dc2, order = c(p, d, q))
      aic <- AIC(fit)

      if (aic < min_aic) {
        min_aic <- aic
        min_order <- c(p, d, q)
      }
    }
  }
}

final_fit <- arima(dc2, order = min_order)
AIC(final_fit)

[1] 13625.42

The results of the test show that the value of the test statistic (Q*) is 0.66009 and the p-value is 0.9953. Since the p-value is much greater than the conventional significance level of 0.05, we can conclude that there is no evidence of residual autocorrelation in the model. In other words, the residuals are independent and the model is a good fit for the data.

checkresiduals(final_fit)


    Ljung-Box test

data:  Residuals from ARIMA(2,0,2) with non-zero mean
Q* = 0.66009, df = 6, p-value = 0.9953

Model df: 4.   Total lags used: 10

fcast <- forecast(final_fit, h = 100)
autoplot(fcast)

cu1 <- accuracy(final_fit)

ETS Model

We will fit an ETS model to our time series data, this will help us identify any patterns or trends in the data that the model may have missed and evaluate whether the model is suitable for forecasting.

In this case, the test was conducted on the residuals of an ETS(M,Ad,N) model. The test result shows that the p-value is extremely small, which indicates strong evidence against the null hypothesis that the residuals are independently distributed. Therefore, the model may not be a good fit for the data, and there may be patterns or structure in the residuals that the model did not capture.

# Forecasting by ETS method

fit11 <- ets(dc2)
checkresiduals(fit11)


    Ljung-Box test

data:  Residuals from ETS(M,Ad,N)
Q* = 135, df = 5, p-value < 0.00000000000000022

Model df: 5.   Total lags used: 10

# Forecasting with ETS method 

fcast1 = forecast(fit11, h = 100)
plot(fcast1, xlab = "Time", ylab = "Price")

summary(fit11)

ETS(M,Ad,N) 

Call:
 ets(y = dc2) 

  Smoothing parameters:
    alpha = 0.6525 
    beta  = 0.0001 
    phi   = 0.98 

  Initial states:
    l = 12145.253 
    b = 435.4593 

  sigma:  0.3307

     AIC     AICc      BIC 
16328.47 16328.59 16356.10 

Training set error measures:
                    ME    RMSE      MAE       MPE     MAPE     MASE
Training set -60.91717 2909.94 1307.506 -66.75619 78.53641 1.385986
                  ACF1
Training set 0.4860302

cu2 <- accuracy(fit11)

For Country “United Kingdom”

Stationary Test

dc3 <- ts(dcf$Dprice)
adf.test(dc3,alternative = 'stationary')


    Augmented Dickey-Fuller Test

data:  dc3
Dickey-Fuller = -3.5392, Lag order = 9, p-value = 0.03854
alternative hypothesis: stationary

The test statistic value of -3.5392 is less than the critical value at 5% significance level, which means that we can reject the null hypothesis of a unit root (non-stationarity) in the data. The p-value of 0.03854 is less than 0.05, which indicates that the test is statistically significant. Therefore, we can conclude that the dataset is stationary based on the results of the ADF test. This implies that the data does not exhibit any trend, seasonality or other systematic patterns over time.

Plots for Order of AR and MA

acf(dc3)

pacf(dc3)

Overall, the coefficients suggest that the time series has some autocorrelation, but the strength of the correlation decreases as the lag increases.

ARIMA Model

In this case, the Ljung-Box test was performed on the residuals of an ARIMA(4,0,20) model with a non-zero mean.

fit0 <- auto.arima(dc3)
min_aic <- AIC(fit0)
min_order <- c(1, 1, 1)

for (p in 1:4) {
  for (d in 0:0) {
    for (q in 1:20) {
      fit <- arima(dc3, order = c(p, d, q))
      aic <- AIC(fit)
      
      if (aic < min_aic) {
        min_aic <- aic
        min_order <- c(p, d, q)
      }
    }
  }
}

final_fit <- arima(dc3, order = min_order)
AIC(final_fit)

[1] 16230.18

Since the p-value is less than the significance level of 0.05, we reject the null hypothesis that there is no autocorrelation in the residuals. Therefore, we can conclude that there is evidence of autocorrelation in the residuals, indicating that the ARIMA(4,0,20) model may not be the best fit for the data.

checkresiduals(final_fit)


    Ljung-Box test

data:  Residuals from ARIMA(4,0,20) with non-zero mean
Q* = 16.564, df = 3, p-value = 0.0008686

Model df: 24.   Total lags used: 27

fcast <- forecast(final_fit, h = 100)
autoplot(fcast)

co1 <- accuracy(final_fit)

ETS Model

The test statistic Q* is 99.516, with 8 degrees of freedom, and an extremely small p-value (< 0.00000000000000022). This indicates strong evidence of autocorrelation in the residuals at some lags, meaning that the ETS(M,N,N) model may not be fully capturing the structure of the time series.

# Forecasting by ETS method

fit1 <- ets(dc3)
checkresiduals(fit1)


    Ljung-Box test

data:  Residuals from ETS(M,N,N)
Q* = 99.516, df = 8, p-value < 0.00000000000000022

Model df: 2.   Total lags used: 10

# Forecasting with ETS method 

fcast1 = forecast(fit1, h = 100)
plot(fcast1, xlab = "Time", ylab = "Price")

summary(fit1)

ETS(M,N,N) 

Call:
 ets(y = dc3) 

  Smoothing parameters:
    alpha = 0.1832 

  Initial states:
    l = 41548.196 

  sigma:  0.5058

     AIC     AICc      BIC 
18884.19 18884.23 18898.01 

Training set error measures:
                   ME     RMSE      MAE       MPE    MAPE      MASE
Training set 299.9736 15269.71 9534.135 -21.61543 42.6256 0.8925546
                 ACF1
Training set 0.179928

co2 <- accuracy(fit1)

Comparison of ARIMA and ETS

In this section, we will compare the performance of ARIMA and ETS models.

Product “REGENCY CAKESTAND 3 TIER”

Description <- as.data.frame(rbind(de1, de2))
rownames(Description) <- c("ARIMA", "ETS")
Description

               ME     RMSE      MAE       MPE     MAPE      MASE
ARIMA    1.781906 581.0624 360.0603 -125.0269 152.2949 0.8344304
ETS   -102.598469 630.5628 420.8989 -170.8606 190.6208 0.9754222
             ACF1
ARIMA 0.002365676
ETS   0.221366926

Based on the table, the ARIMA model performs better than the ETS model on most of the metrics, except for ACF1 where ETS performs better. Therefore, if the accuracy of the forecast is the main priority and the correlation between the residuals is not a concern, then the ARIMA model may be the better choice. However, if minimizing the correlation between the residuals is important, then the ETS model may be preferred.

Customer ID “18102”

Customer <- as.data.frame(rbind(cu1, cu2))
rownames(Customer) <- c("ARIMA", "ETS")
Customer

              ME    RMSE       MAE       MPE     MAPE     MASE
ARIMA   7.512198 2416.56  972.9974 -62.44116 73.86707 1.031400
ETS   -60.917174 2909.94 1307.5060 -66.75619 78.53641 1.385986
              ACF1
ARIMA -0.007599638
ETS    0.486030223

Comparing the two models, we can see that ETS outperforms ARIMA in terms of RMSE, MAE, and MASE, indicating that it has a better fit. However, ARIMA outperforms ETS in terms of ACF1, indicating that it has less correlation between the errors.

Country “United Kingdom”

Country <- as.data.frame(rbind(co1, co2))
rownames(Country) <- c("ARIMA", "ETS")
Country

             ME     RMSE      MAE       MPE     MAPE      MASE
ARIMA -13.17208 13621.44 8442.325 -16.65399 35.03267 0.7903429
ETS   299.97364 15269.71 9534.135 -21.61543 42.62560 0.8925546
             ACF1
ARIMA 0.005477276
ETS   0.179928024

Based on these metrics, it appears that neither model is performing particularly well, with high errors and high MAPE. However, the ARIMA model is performing slightly better than the ETS model, with lower RMSE, MAE, and MPE, and a lower MASE value.

Conclusion

Firstly, the “REGENCY CAKESTAND 3 TIER” product generates the highest revenue, and analyzing its trends and patterns in pricing over time can help inform pricing strategies and identify market trends. The plot of “TotalPrice” of this product over time shows that there is a significant increase in revenue in late 2010 and early 2011, which could be due to seasonal factors or other factors such as marketing campaigns or changes in consumer preferences.

Secondly, analyzing the spending behavior of the top customer can provide valuable insights, help identify potential areas for growth, and inform marketing and sales strategies. The plot of “TotalPrice” of the top customer over time shows that there are some fluctuations in spending, but overall, there is a steady increase in spending over time.

Thirdly, analyzing the spending behavior of the top country, which is the United Kingdom, can provide insights into the economic conditions and market trends in that country. The plot of the total spending of the United Kingdom over time shows that there are some fluctuations, but overall, there is a steady increase in spending over time, with a significant increase in late 2010 and early 2011, similar to the trend observed for the top product.

Finally, comparing the performance of ARIMA and ETS models for forecasting the revenue of the top product, top customer, and top country, the choice of the better model depends on the specific priorities and concerns of the analysis. In general, the ARIMA model performs better in terms of forecast accuracy, while the ETS model performs better in terms of minimizing the correlation between the residuals. The choice of model should depend on the specific needs of the analysis.

In conclusion, the Online Retail II dataset provides valuable insights into the revenue-generating areas of the business, by analyzing trends and patterns in pricing and spending behavior over time, valuable insights can be gained that can inform strategic decision-making, such as pricing strategies, targeting specific market segments, and more. Comparing the performance of forecasting models can also help improve the accuracy of future revenue predictions.

Future Work

Given the superior performance of the ARIMA model in this scenario, it is recommended that organizations and businesses consider using the ARIMA model for their “TotalPrice” forecasting needs. However, other time series models such as SARIMA, SARIMAX, and VAR, may also be worth exploring.

In addition to the initial goal of this research, this dataset and analysis may be utilized to solve a variety of business and organizational challenges and questions, such as:

Customer segmentation: The data can be used to segment customers based on their purchasing behavior and demographics. This will help the company understand which groups of customers are most valuable and how to target them effectively with marketing campaigns.
Product recommendations: Based on the historical sales data, we can create a recommendation system that suggests products to customers based on their past purchases. This will help increase customer engagement and sales.
Inventory management: The company can use the sales data to forecast future demand for specific products, which will help them optimize their inventory levels and reduce the cost of overstocking or stock shortages.
Regional analysis: The company can use the data to compare sales across different regions and understand the impact of regional factors such as cultural differences, climate, and economic conditions on sales. Overall, this project can help the organization optimize its operations and increase revenue, ultimately leading to greater success in the competitive business domain.

References

Kaggle. (2020). Online Retail II UCI. Retrieved from https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci
Joel Grus. (2019, April). Data Science from Scratch. O’Reilly Media, Inc.
Marr, Bernard. (2016). Big Data in Practice. Wiley.
Witt, A., Kurths, J., & Pikovsky, A. (1998). Testing Stationarity in Time Series.
Demir, V., Zontul, M., & Yelmen, İ. (2020, September). Drug Sales Prediction with ACF and PACF Supported ARIMA Method. In 2020 5th International Conference on Computer Science and Engineering (UBMK) (pp. 243-247). IEEE.
Chen, P., Yuan, H., & Shu, X. (2008, October). Forecasting Crime Using the ARIMA Model. In 2008 5th International Conference on Fuzzy Systems and Knowledge Discovery (Vol. 5, pp. 627-630). IEEE.
Jain, G., & Mallick, B. (2017). A Study of Time Series Models ARIMA and ETS. Available at SSRN 2898968.

Predicting Future Revenue: A Time Series Analysis of Online Retail Transactions

Data Descriptions

Data Cleaning

Exploratory Data Analysis

Product Descriptions

Customer ID

Country

Filtering Data

Top Product

Top Customer ID

Top Country

Model Selection

Stationary Test

ACF and PACF

ARIMA Model

ETS Model

Time Series Analysis

For Product “REGENCY CAKESTAND 3 TIER”

Stationary Test

Plots for order of AR and MA

ARIMA Model

ETS Model

For Customer ID “18102”

Stationary Test

Plots for Order of AR and MA

ARIMA Model

ETS Model

For Country “United Kingdom”

Stationary Test

Plots for Order of AR and MA

ARIMA Model

ETS Model

Comparison of ARIMA and ETS

Product “REGENCY CAKESTAND 3 TIER”

Customer ID “18102”

Country “United Kingdom”

Conclusion

Future Work

References