top of page
Search

Crop Yield Optimization: Predictive Modeling & Dashboard Insights

  • Writer: Jeffrey Frankenfeld
    Jeffrey Frankenfeld
  • Mar 21
  • 4 min read

Updated: May 1

Background

Understanding what drives agricultural productivity is essential for food security, sustainability, and strategic planning. This project explores a dataset blending real agricultural data and synthetic augmentation - combining regional crop yields, meteorological data, and simulated growing conditions across the U.S. The goal was to identify which environmental and operational factors influence yield, and to visualize how these insights can inform better agricultural practices through predictive modeling and interactive dashboard design.



Goals

  • Analyze agricultural data to identify the key factors influencing crop yield across U.S. regions

  • Develop a hypothesis using exploratory data analysis

  • Apply regression modeling, clustering and decision tree techniques to test the hypothesis

  • Visualize results in an interactive Tableau dashboard

  • Support data-informed decision making for agricultural planning and optimization


ree


Exploratory Data Analysis

I began by exploring the dataset to uncover patterns and relationships that could help explain variations in crop yield


Identifying Relationships

The correlation heatmap below highlights how various features interact


Variable Exploration (heatmap
Variable Exploration (heatmap
  • Rainfall had the strongest correlation with yield (r = 0.87), suggesting it may be the most important factor in determining productivity

  • Fertilizer and irrigation also showed moderate positive relationships

  • Temperature and days to harvest showed weak or no significant correlation with yield


From Correlation to Hypothesis Testing

Based on the initial relationships, I developed the following hypothesis to guide the modeling phase:


Null Hypothesis (H₀):

There is no significant relationship between the amount of rainfall a crop receives and its yield.


Alternative Hypothesis (H₁):

There is a significant positive relationship between the amount of rainfall a crop receives and its yield.



Geospatial Analysis

To better understand how crop yield and rainfall vary across regions, I created a set of choropleth maps using U.S. Census regions. This spatial analysis helped validate earlier findings and made regional trends easier to interpret.


Rainfall by Region (choropleth map)
Rainfall by Region (choropleth map)

Yield by Region (choropleth map)
Yield by Region (choropleth map)
  • Regions with higher total rainfall consistently showed higher crop yields

  • The North experienced the highest average rainfall and the greatest yields per hectare

  • Eastern regions, in contrast, had both lower rainfall and lower yields, further reinforcing the strength of the rainfall - yield relationship identified in the analysis


These maps confirmed the hypothesis from a geographic perspective: rainfall not only correlates statistically with yield, but also clusters spatially across the U.S.



Modeling

To better understand the drivers of crop yield and validate the patterns observed during exploratory analysis, I applied a series of predictive modeling techniques. These included regression, clustering, and decision tree analysis - each helped to quantify relationships, uncover patterns, and identify actionable variables.


Regression Analysis

To quantify the impact of rainfall on crop yield, I built a linear regression model

Yield vs Rainfall - Test Set (scatterplot w/ regression line)
Yield vs Rainfall - Test Set (scatterplot w/ regression line)
  • R² Score: 0.58 - Rainfall explains 58% of the variation in yield

  • Slope: Each additional mm of rainfall increases yield by 0.005 tons/hectare

  • P - value: 0 - Statistically significant relationship

  • MSE 1.19 - Average prediction error


Clustering Analysis

To explore how environmental conditions group together, I applied K-Means clustering to segment observations by yield, rainfall, and supporting variables. Three distinct clusters emerged, primarily driven by rainfall levels

Rainfall vs Yield (scatterplot w/ cluster coloring)
Rainfall vs Yield (scatterplot w/ cluster coloring)

High Rainfall Cluster (701 - 999 mm)

  • Average Yield: 6.15 tons/hectare

  • Mean Rainfall: 850 mm

  • Harvest Time: Slightly shorter than other clusters

  • Insight: These regions benefit from abundant water, resulting in faster, more productive growing cycles. This is the highest - yielding cluster.

Moderate Rainfall Cluster (401 - 700 mm)

  • Average Yield: 4.68 tons/hectare

  • Mean Rainfall: 551 mm

  • Harvest Time: 104.5 days

  • Insight: Represents balanced growing conditions that still support strong yields despite less rainfall.

Low Rainfall Cluster (100 - 400 mm)

  • Average Yield: 3.15 tons/hectare

  • Mean Rainfall: 250 mm

  • Harvest Time: Similar to other clusters (~104.6 days)

  • Insight: Lower yields despite similar growing periods suggest rainfall is the limiting factor. Irrigation may be critical in these regions.

Supporting Variable Insights

  • Rainfall vs Yield: Clear stratification across clusters confirms rainfall is the most important driver of yield

  • Temperature vs Yield: No discernable pattern. Yield is evenly spread across temperatures, suggesting temperature is not a major factor in the dataset

  • Days to Harvest vs. Yield: Similar harvest durations across clusters, regardless of yield



Operational Insights: Fertilizer & Irrigation

The decision tree analysis revealed that after rainfall, the two most impactful variables influencing yield were fertilizer use and irrigation. To better understand how these common agricultural practices affect productivity, I analyzed yield totals across all combinations of fertilizer and irrigation usage - segmented by rainfall conditions.

Fertilizer & Irrigation Status (stacked bar chart)
Fertilizer & Irrigation Status (stacked bar chart)
  • Crops that received both fertilizer and irrigation produced the highest total yield across all rainfall environments

  • Using fertilizer only outperformed irrigation-only setups, but both independently increased productivity

  • Crops that received neither consistently produced the lowest yields, especially in low rainfall regions

  • Yield increases by intervention (compared to no intervention):

    • +38% with fertilizer

    • +30% with irrigation

    • +43% with both



Key Takeaways

Based on the results of the exploratory analysis and predictive modeling, several clear conclusions emerged:

  • Rainfall is the strongest predictor of crop yield, with a correlation of 0.87 and an R² score of 0.58 in the regression model

  • With a p-value of 0, the null hypothesis was rejected, confirming a statistically significant relationship between rainfall and crop yield

  • Fertilizer use and irrigation were identified as the next most important yield factors - particularly in low and moderate rainfall environments.



Recommendations

  • In regions receiving 700mm or more of rainfall, expect higher baseline yields - but the addition of fertilizer and irrigation can further enhance results

  • In regions with moderate or low rainfall, agricultural practices like fertilizer and irrigation should be prioritized to mitigate environmental limitations and maintain productivity



Final Thoughts

This project gave me the opportunity to combine exploratory analysis, geospatial visualization, and machine learning techniques to uncover what drives crop yield across different regions and conditions. It was especially valuable to explore how predictive modeling can validate assumptions and surface practical recommendations.


What I Would Improve

  • Strengthen my overall machine learning skills, especially in tuning and validating models

  • Explore multivariate regression with more advanced feature engineering



Deliverables




 
 
 

Comments


bottom of page