Kaitlin Khong

Predicting Airbnb Listing Prices in San Francisco

Author: Kaitlin Khong. This is a final project for Statistical Machine Learning Course, UC Santa Barbara, Spring 2022.

Background

Exploratory Data Analysis

Airbnb has evolved from a distinctive business concept to a popular online marketplace, offering people to stay in unique and personalized accommodations provided by homeowners. This platform offers a perfect alternative to traditional hotels, catering to diverse needs such as accommodations for a weekend getaway, business trip, or vacation. Connecting travelers with local homeowners, Airbnb not only offers a unique travel experience but also presents an opportunity to explore the rental prices of Airbnbs for countless locations worldwide. This project will focus on the rental prices in the vibrant city of San Francisco.

SF is a renowned tourist destination that draws in millions of tourists each year. With such a high influx of tourists in the Bay Area, it becomes no surprise that many individuals opt to stay in Airbnbs as their preferred accommodation when exploring San Francisco.

Given the popularity of Airbnb in the city, it becomes valuable to have a predictive model that can estimate listing prices. This model would provide insights for both the host and guests, allowing hosts to make informed pricing decisions and enabling guests to effectively plan their stays in San Francisco.

 Dataset Source: Kaggle SF Airbnb

To effectively predict the Airbnb listing prices in San Francisco, narrowing down the variables to a more manageable set is essential. Given that the dataset contains 8111 observations and 106 variables, we are going to have to carefully narrow that down. Among these variables, we will focus on those that directly influence the prices, such as location factors, accommodation details, and host listing details. Looking at the data, it is evident that certain variables such as “id”, “listing_url”, “scrape_id” do not appear to be relevant predictors when predicting the listing prices. Our attention should be focused towards filtering out these unwanted variables to create a more meaningful set.

We will remove location features such as “state”, “zipcode” and “is_locaition_exact” which are better explained by “longitude” and “latitude”. The “weekly_price” and “monthly_price” will be taken out since we will focus on single day listings.

Missing Data 

Before we start exploring the variables within our data, let’s visualize the NA distribution. Based on the bar chart, it is interesting to note that the review_scores_ratings has the most missing data, with a count of 1651. Since review_scores_ratings may play a significant role in predicting price, assigning a value of 0 or removing the NA values will heavily skew the ratings. We will handle this by filling the missing values with the mean value, which happens to be quite high at 95.42152. This method will ensure accurate price predictions based on the available data.

Exploratory Data Analysis (EDA)

San Francisco, spanning a compact area of 7 by 7 miles, offers the convenience of visualizing the city divided into 53 distinct neighborhoods. Now, let’s explore the geographical distribution of all 53 neighborhoods by examining their respective longitude and latitude coordinates.

Analyzing the correlation plot, it is surprising to see moderate correlations between the price and the geographical latitude and longitude; however we will further investigate this relationship through a scatterplot later on. We also observe a moderate positive correlation between bedrooms and price, which aligns with our expectations.

Modeling

Now, we will focus on fitting models to our data to find the most influential factors in predicting the listing prices of SF Airbnbs. Our initial step will be preparing our data by dividing it into training and testing sets, creating the recipe, and generating 10 equal folds for k-fold cross validation.

We will use a stratified split to ensure unbiased evaluation, meaning that the divided data into training and testing splits will have a similar distribution of our outcome variable: price. Using a 80/20 split with stratification allows for a balance between providing sufficient data for the data to learn, and enough data for unbiased assessment.

We can conclude that the random forest model outperforms the other four models. With an RMSE of 57.3, it seems that the random forest with 9 predictors, 115 trees, and a mtry of 10 had the lowest error and the highest accuracy out of all the models. Therefore, the random forest model with those parameters is the best choice for predicting the price.

A final RMSE value of 56.9 indicates that the model performed better on the testing set compared to the cross-validation folds. This suggests that our random forest model has strong predictive capabilities beyond the training, while explaining the variation in the target variable. The smaller RMSE reinforces the effectiveness of the model in predicting price.

 

Conclusion

In hindsight, I recognize that there is room for enhancing the project’s performance. I would consider exploring different models, specifically neural networks, to see if this method would achieve a lower RMSE. Regrettably, time constraints limited my ability to thoroughly investigate this possibility.

Moving forward, I would dedicate more time and resources to tuning and experimenting with hyper-parameters.