jueves, 22 de agosto de 2013

Forecasting with Neural Network (using SAP HANA and R)

Usually...when I use R...I try to use it with SAP HANA as well...and the most simple way to make them work is for sure an ODBC connection....fast and simple...

First, I went to my SAP HANA Studio to get the query...which is basically...get all the seats (First, Business, Economy) from all flights that happened between 2010 and 2012.

Query from HANA Studio
SELECT FLDATE, SEATSOCC, SEATSOCC_B, SEATSOCC_F 
FROM SFLIGHT.SFLIGHT WHERE YEAR(FLDATE) BETWEEN '2010' AND '2012'

The result will look like this...


But...we want to have all seats summed up in one variable and also organized by month/year. And while we can do that using SQL...that will take away the fun from R...

Getting and formatting data
library("RODBC")

format_date<-function(p_date){
  p_date<-as.Date(as.character(p_date),"%Y%m%d")
  p_date<-format(p_date,"%Y%m")
  return(p_date)
}

ch<-odbcConnect("HANA",uid="SYSTEM",pwd="********")
result<-sqlQuery(ch,"SELECT FLDATE, SEATSOCC, SEATSOCC_B, SEATSOCC_F 
                 FROM SFLIGHT.SFLIGHT WHERE YEAR(FLDATE) BETWEEN '2010' AND '2012'")
odbcCloseAll()
dates<-result$FLDATE
dates<-format_date(dates)
result$FLDATE = dates
result_agg<-aggregate(cbind(SEATSOCC,SEATSOCC_B,SEATSOCC_F)~.,data=result,FUN=sum)
result_total<-data.frame(FLDATE=result_agg$FLDATE,SEATS=result_agg$SEATSOCC+
                         result_agg$SEATSOCC_B+result_agg$SEATSOCC_F,stringsAsFactors=FALSE)

First, by using the RODBC library we connect to our SAP HANA Server. Then, we grab the dates and using a custom function, convert them to Month/Year. We do an aggregation to get the sum of all seats and then built a data frame to hold the data.

When we print out the final result...we will realize something that for sure, escaped our eyes before...


For 2010, the months start on April, meaning that from January to March there's no information...and the same happens to 2012 where the information end on April and from June to December there's nothing.

As we want to do a Forecasting using Neural Network...this incomplete data is never going to work...so we need to do something to fix it...

One thing to do at least for 2012, it's the Moving Average...which is basically grab the values from January to April, sum them, divide them by the number of months and then assign this value to May (201205)...then...grab the value from February to May and do the same thing for June...and so on -:)

For 2010 seems look a little bit more complicated...but it's almost the same...I used something that I like to call Backward Average...despite what the real name is -:P Basically, we grab the values from December to April, sum them, divide the value by the number of months and determine the value for March...and so on...

Let's see the code...

Moving and Backward Average
library("RODBC")

format_date<-function(p_date){
  p_date<-as.Date(as.character(p_date),"%Y%m%d")
  p_date<-format(p_date,"%Y%m")
  return(p_date)
}

moving_average<-function(p_values,year_start,month_start,year_end,month_end){
  month<-as.numeric(month_start) - 1
  init_month<-"01"
  if(length(month)==1){
    init_date<-paste(year_start,"0",month,sep='')
  }
  base_date<-paste(year_start,init_month,sep='')
  counter<-as.numeric(month_end) - as.numeric(month_start)
  
  values<-p_values
  
  for(i in 0:counter){
    values<-subset(values, FLDATE <= init_date & FLDATE >= base_date) 
    new_value<-floor(mean(values$SEATS))
    new_values<-data.frame(FLDATE=paste(year_start,month_start,sep=''),SEATS=new_value)
    p_values<-rbind(p_values,new_values)
    month_start<-as.numeric(month_start) + 1
    if(nchar(month_start)==1){
      month_start<-paste("0",month_start,sep='')
    }
    values<-rbind(values,new_values)
    month<-month + 1
    init_month<-as.numeric(init_month) + 1
    if(nchar(month)==1){
      init_date<-paste(year_start,"0",month,sep='')
    }else{
      init_date<-paste(year_start,month,sep='')
    }
    if(nchar(init_month)==1){
      base_date<-paste(year_start,"0",init_month,sep='')  
    }else{
      base_date<-paste(year_start,init_month,sep='')  
    }
  }
  return(p_values)
}

backward_average<-function(p_values,year_start,month_start,year_end,month_end){
  month<-as.numeric(month_start) - 1
  init_month<-"12"
  if(length(month)==1){
    init_date<-paste(year_start,"0",month,sep='')
  }
  base_date<-paste(year_start,init_month,sep='')
  counter<-as.numeric(month_start) - as.numeric(month_end)
  
  values<-p_values
  
  for(i in 0:counter){
    values<-subset(values, FLDATE <= base_date & FLDATE >= init_date) 
    new_value<-floor(mean(values$SEATS))
    new_values<-data.frame(FLDATE=paste(year_start,month_start,sep=''),SEATS=new_value)
    p_values<-rbind(p_values,new_values)
    month_start<-as.numeric(month_start) - 1
    if(nchar(month_start)==1){
      month_start<-paste("0",month_start,sep='')
    }
    values<-rbind(values,new_values)
    month<-month + 1
    init_month<-as.numeric(init_month) - 1
    if(nchar(month)==1){
      init_date<-paste(year_start,"0",month,sep='')
    }else{
      init_date<-paste(year_start,month,sep='')
    }
    if(nchar(init_month)==1){
      base_date<-paste(year_start,"0",init_month,sep='')  
    }else{
      base_date<-paste(year_start,init_month,sep='')  
    }
  }
  return(p_values)
}

ch<-odbcConnect("HANA",uid="SYSTEM",pwd="********")
result<-sqlQuery(ch,"SELECT FLDATE, SEATSOCC, SEATSOCC_B, SEATSOCC_F 
                 FROM SFLIGHT.SFLIGHT WHERE YEAR(FLDATE) BETWEEN '2010' AND '2012'")
odbcCloseAll()
dates<-result$FLDATE
dates<-format_date(dates)
result$FLDATE = dates
result_agg<-aggregate(cbind(SEATSOCC,SEATSOCC_B,SEATSOCC_F)~.,data=result,FUN=sum)
result_total<-data.frame(FLDATE=result_agg$FLDATE,SEATS=result_agg$SEATSOCC+
                         result_agg$SEATSOCC_B+result_agg$SEATSOCC_F,stringsAsFactors=FALSE)
result_total<-moving_average(result_total,"2012","05","2012","12")
result_total<-backward_average(result_total,"2010","03","2010","01")

When we run and execute this...we will see that in fact...we now have all the 3 years completed -;)


Now...we can finally use the Forecasting -;)

Neural_Network_Forecasting.R
library("RODBC")
library("forecast")

format_date<-function(p_date){
  p_date<-as.Date(as.character(p_date),"%Y%m%d")
  p_date<-format(p_date,"%Y%m")
  return(p_date)
}

moving_average<-function(p_values,year_start,month_start,year_end,month_end){
  month<-as.numeric(month_start) - 1
  init_month<-"01"
  if(length(month)==1){
    init_date<-paste(year_start,"0",month,sep='')
  }
  base_date<-paste(year_start,init_month,sep='')
  counter<-as.numeric(month_end) - as.numeric(month_start)
  
  values<-p_values
  
  for(i in 0:counter){
    values<-subset(values, FLDATE <= init_date & FLDATE >= base_date) 
    new_value<-floor(mean(values$SEATS))
    new_values<-data.frame(FLDATE=paste(year_start,month_start,sep=''),SEATS=new_value)
    p_values<-rbind(p_values,new_values)
    month_start<-as.numeric(month_start) + 1
    if(nchar(month_start)==1){
      month_start<-paste("0",month_start,sep='')
    }
    values<-rbind(values,new_values)
    month<-month + 1
    init_month<-as.numeric(init_month) + 1
    if(nchar(month)==1){
      init_date<-paste(year_start,"0",month,sep='')
    }else{
      init_date<-paste(year_start,month,sep='')
    }
    if(nchar(init_month)==1){
      base_date<-paste(year_start,"0",init_month,sep='')  
    }else{
      base_date<-paste(year_start,init_month,sep='')  
    }
  }
  return(p_values)
}

backward_average<-function(p_values,year_start,month_start,year_end,month_end){
  month<-as.numeric(month_start) - 1
  init_month<-"12"
  if(length(month)==1){
    init_date<-paste(year_start,"0",month,sep='')
  }
  base_date<-paste(year_start,init_month,sep='')
  counter<-as.numeric(month_start) - as.numeric(month_end)
  
  values<-p_values
  
  for(i in 0:counter){
    values<-subset(values, FLDATE <= base_date & FLDATE >= init_date) 
    new_value<-floor(mean(values$SEATS))
    new_values<-data.frame(FLDATE=paste(year_start,month_start,sep=''),SEATS=new_value)
    p_values<-rbind(p_values,new_values)
    month_start<-as.numeric(month_start) - 1
    if(nchar(month_start)==1){
      month_start<-paste("0",month_start,sep='')
    }
    values<-rbind(values,new_values)
    month<-month + 1
    init_month<-as.numeric(init_month) - 1
    if(nchar(month)==1){
      init_date<-paste(year_start,"0",month,sep='')
    }else{
      init_date<-paste(year_start,month,sep='')
    }
    if(nchar(init_month)==1){
      base_date<-paste(year_start,"0",init_month,sep='')  
    }else{
      base_date<-paste(year_start,init_month,sep='')  
    }
  }
  return(p_values)
}

ch<-odbcConnect("HANA",uid="SYSTEM",pwd="********")
result<-sqlQuery(ch,"SELECT FLDATE, SEATSOCC, SEATSOCC_B, SEATSOCC_F 
                 FROM SFLIGHT.SFLIGHT WHERE YEAR(FLDATE) BETWEEN '2010' AND '2012'")
odbcCloseAll()
dates<-result$FLDATE
dates<-format_date(dates)
result$FLDATE = dates
result_agg<-aggregate(cbind(SEATSOCC,SEATSOCC_B,SEATSOCC_F)~.,data=result,FUN=sum)
result_total<-data.frame(FLDATE=result_agg$FLDATE,SEATS=result_agg$SEATSOCC+
                         result_agg$SEATSOCC_B+result_agg$SEATSOCC_F,stringsAsFactors=FALSE)
result_total<-moving_average(result_total,"2012","05","2012","12")
result_total<-backward_average(result_total,"2010","03","2010","01")
result_total <- result_total[order(result_total$FLDATE),] 
result_ts<-ts(result_total$SEATS,frequency=12,start=c(2010,1))

fit <- nnetar(result_ts)
fcast <- forecast(fit)
plot(fcast)

Let's take a lot at the generated plot...


As we can see...the prediction for 2013 to 2015 is very low...but that's due to the fact that 2012 was a very low year...

Greetings,

Blag.