Crop Yield Optimization: Predictive Modeling & Dashboard Insights

Jeffrey Frankenfeld
Mar 21
4 min read

Updated: May 1

Background

Understanding what drives agricultural productivity is essential for food security, sustainability, and strategic planning. This project explores a dataset blending real agricultural data and synthetic augmentation - combining regional crop yields, meteorological data, and simulated growing conditions across the U.S. The goal was to identify which environmental and operational factors influence yield, and to visualize how these insights can inform better agricultural practices through predictive modeling and interactive dashboard design.

Goals

Analyze agricultural data to identify the key factors influencing crop yield across U.S. regions
Develop a hypothesis using exploratory data analysis
Apply regression modeling, clustering and decision tree techniques to test the hypothesis
Visualize results in an interactive Tableau dashboard
Support data-informed decision making for agricultural planning and optimization

Exploratory Data Analysis

I began by exploring the dataset to uncover patterns and relationships that could help explain variations in crop yield

Identifying Relationships

The correlation heatmap below highlights how various features interact

Rainfall had the strongest correlation with yield (r = 0.87), suggesting it may be the most important factor in determining productivity
Fertilizer and irrigation also showed moderate positive relationships
Temperature and days to harvest showed weak or no significant correlation with yield

From Correlation to Hypothesis Testing

Based on the initial relationships, I developed the following hypothesis to guide the modeling phase:

Null Hypothesis (H₀):

There is no significant relationship between the amount of rainfall a crop receives and its yield.

Alternative Hypothesis (H₁):

There is a significant positive relationship between the amount of rainfall a crop receives and its yield.

Geospatial Analysis

To better understand how crop yield and rainfall vary across regions, I created a set of choropleth maps using U.S. Census regions. This spatial analysis helped validate earlier findings and made regional trends easier to interpret.

Regions with higher total rainfall consistently showed higher crop yields
The North experienced the highest average rainfall and the greatest yields per hectare
Eastern regions, in contrast, had both lower rainfall and lower yields, further reinforcing the strength of the rainfall - yield relationship identified in the analysis

These maps confirmed the hypothesis from a geographic perspective: rainfall not only correlates statistically with yield, but also clusters spatially across the U.S.

Modeling

To better understand the drivers of crop yield and validate the patterns observed during exploratory analysis, I applied a series of predictive modeling techniques. These included regression, clustering, and decision tree analysis - each helped to quantify relationships, uncover patterns, and identify actionable variables.

Regression Analysis

To quantify the impact of rainfall on crop yield, I built a linear regression model

Yield vs Rainfall - Test Set (scatterplot w/ regression line)

R² Score: 0.58 - Rainfall explains 58% of the variation in yield
Slope: Each additional mm of rainfall increases yield by 0.005 tons/hectare
P - value: 0 - Statistically significant relationship
MSE 1.19 - Average prediction error

Clustering Analysis

To explore how environmental conditions group together, I applied K-Means clustering to segment observations by yield, rainfall, and supporting variables. Three distinct clusters emerged, primarily driven by rainfall levels

Rainfall vs Yield (scatterplot w/ cluster coloring)

High Rainfall Cluster (701 - 999 mm)

Average Yield: 6.15 tons/hectare
Mean Rainfall: 850 mm
Harvest Time: Slightly shorter than other clusters
Insight: These regions benefit from abundant water, resulting in faster, more productive growing cycles. This is the highest - yielding cluster.

Moderate Rainfall Cluster (401 - 700 mm)

Average Yield: 4.68 tons/hectare
Mean Rainfall: 551 mm
Harvest Time: 104.5 days
Insight: Represents balanced growing conditions that still support strong yields despite less rainfall.

Low Rainfall Cluster (100 - 400 mm)

Average Yield: 3.15 tons/hectare
Mean Rainfall: 250 mm
Harvest Time: Similar to other clusters (~104.6 days)
Insight: Lower yields despite similar growing periods suggest rainfall is the limiting factor. Irrigation may be critical in these regions.

Supporting Variable Insights

Rainfall vs Yield: Clear stratification across clusters confirms rainfall is the most important driver of yield
Temperature vs Yield: No discernable pattern. Yield is evenly spread across temperatures, suggesting temperature is not a major factor in the dataset
Days to Harvest vs. Yield: Similar harvest durations across clusters, regardless of yield

Operational Insights: Fertilizer & Irrigation

The decision tree analysis revealed that after rainfall, the two most impactful variables influencing yield were fertilizer use and irrigation. To better understand how these common agricultural practices affect productivity, I analyzed yield totals across all combinations of fertilizer and irrigation usage - segmented by rainfall conditions.

Fertilizer & Irrigation Status (stacked bar chart)

Crops that received both fertilizer and irrigation produced the highest total yield across all rainfall environments
Using fertilizer only outperformed irrigation-only setups, but both independently increased productivity
Crops that received neither consistently produced the lowest yields, especially in low rainfall regions
Yield increases by intervention (compared to no intervention):
- +38% with fertilizer
- +30% with irrigation
- +43% with both

Key Takeaways

Based on the results of the exploratory analysis and predictive modeling, several clear conclusions emerged:

Rainfall is the strongest predictor of crop yield, with a correlation of 0.87 and an R² score of 0.58 in the regression model
With a p-value of 0, the null hypothesis was rejected, confirming a statistically significant relationship between rainfall and crop yield
Fertilizer use and irrigation were identified as the next most important yield factors - particularly in low and moderate rainfall environments.

Recommendations

In regions receiving 700mm or more of rainfall, expect higher baseline yields - but the addition of fertilizer and irrigation can further enhance results
In regions with moderate or low rainfall, agricultural practices like fertilizer and irrigation should be prioritized to mitigate environmental limitations and maintain productivity

Final Thoughts

This project gave me the opportunity to combine exploratory analysis, geospatial visualization, and machine learning techniques to uncover what drives crop yield across different regions and conditions. It was especially valuable to explore how predictive modeling can validate assumptions and surface practical recommendations.

What I Would Improve

Strengthen my overall machine learning skills, especially in tuning and validating models
Explore multivariate regression with more advanced feature engineering