Crop Yield Optimization: Predictive Modeling & Dashboard Insights
- Jeffrey Frankenfeld
- Mar 21
- 4 min read
Updated: May 1
Background
Understanding what drives agricultural productivity is essential for food security, sustainability, and strategic planning. This project explores a dataset blending real agricultural data and synthetic augmentation - combining regional crop yields, meteorological data, and simulated growing conditions across the U.S. The goal was to identify which environmental and operational factors influence yield, and to visualize how these insights can inform better agricultural practices through predictive modeling and interactive dashboard design.
Goals
Analyze agricultural data to identify the key factors influencing crop yield across U.S. regions
Develop a hypothesis using exploratory data analysis
Apply regression modeling, clustering and decision tree techniques to test the hypothesis
Visualize results in an interactive Tableau dashboard
Support data-informed decision making for agricultural planning and optimization

Exploratory Data Analysis
I began by exploring the dataset to uncover patterns and relationships that could help explain variations in crop yield
Identifying Relationships
The correlation heatmap below highlights how various features interact

Rainfall had the strongest correlation with yield (r = 0.87), suggesting it may be the most important factor in determining productivity
Fertilizer and irrigation also showed moderate positive relationships
Temperature and days to harvest showed weak or no significant correlation with yield
From Correlation to Hypothesis Testing
Based on the initial relationships, I developed the following hypothesis to guide the modeling phase:
Null Hypothesis (H₀):
There is no significant relationship between the amount of rainfall a crop receives and its yield.
Alternative Hypothesis (H₁):
There is a significant positive relationship between the amount of rainfall a crop receives and its yield.
Geospatial Analysis
To better understand how crop yield and rainfall vary across regions, I created a set of choropleth maps using U.S. Census regions. This spatial analysis helped validate earlier findings and made regional trends easier to interpret.


Regions with higher total rainfall consistently showed higher crop yields
The North experienced the highest average rainfall and the greatest yields per hectare
Eastern regions, in contrast, had both lower rainfall and lower yields, further reinforcing the strength of the rainfall - yield relationship identified in the analysis
These maps confirmed the hypothesis from a geographic perspective: rainfall not only correlates statistically with yield, but also clusters spatially across the U.S.
Modeling
To better understand the drivers of crop yield and validate the patterns observed during exploratory analysis, I applied a series of predictive modeling techniques. These included regression, clustering, and decision tree analysis - each helped to quantify relationships, uncover patterns, and identify actionable variables.
Regression Analysis
To quantify the impact of rainfall on crop yield, I built a linear regression model

R² Score: 0.58 - Rainfall explains 58% of the variation in yield
Slope: Each additional mm of rainfall increases yield by 0.005 tons/hectare
P - value: 0 - Statistically significant relationship
MSE 1.19 - Average prediction error
Clustering Analysis
To explore how environmental conditions group together, I applied K-Means clustering to segment observations by yield, rainfall, and supporting variables. Three distinct clusters emerged, primarily driven by rainfall levels

High Rainfall Cluster (701 - 999 mm)
Average Yield: 6.15 tons/hectare
Mean Rainfall: 850 mm
Harvest Time: Slightly shorter than other clusters
Insight: These regions benefit from abundant water, resulting in faster, more productive growing cycles. This is the highest - yielding cluster.
Moderate Rainfall Cluster (401 - 700 mm)
Average Yield: 4.68 tons/hectare
Mean Rainfall: 551 mm
Harvest Time: 104.5 days
Insight: Represents balanced growing conditions that still support strong yields despite less rainfall.
Low Rainfall Cluster (100 - 400 mm)
Average Yield: 3.15 tons/hectare
Mean Rainfall: 250 mm
Harvest Time: Similar to other clusters (~104.6 days)
Insight: Lower yields despite similar growing periods suggest rainfall is the limiting factor. Irrigation may be critical in these regions.
Supporting Variable Insights
Rainfall vs Yield: Clear stratification across clusters confirms rainfall is the most important driver of yield
Temperature vs Yield: No discernable pattern. Yield is evenly spread across temperatures, suggesting temperature is not a major factor in the dataset
Days to Harvest vs. Yield: Similar harvest durations across clusters, regardless of yield
Operational Insights: Fertilizer & Irrigation
The decision tree analysis revealed that after rainfall, the two most impactful variables influencing yield were fertilizer use and irrigation. To better understand how these common agricultural practices affect productivity, I analyzed yield totals across all combinations of fertilizer and irrigation usage - segmented by rainfall conditions.

Crops that received both fertilizer and irrigation produced the highest total yield across all rainfall environments
Using fertilizer only outperformed irrigation-only setups, but both independently increased productivity
Crops that received neither consistently produced the lowest yields, especially in low rainfall regions
Yield increases by intervention (compared to no intervention):
+38% with fertilizer
+30% with irrigation
+43% with both
Key Takeaways
Based on the results of the exploratory analysis and predictive modeling, several clear conclusions emerged:
Rainfall is the strongest predictor of crop yield, with a correlation of 0.87 and an R² score of 0.58 in the regression model
With a p-value of 0, the null hypothesis was rejected, confirming a statistically significant relationship between rainfall and crop yield
Fertilizer use and irrigation were identified as the next most important yield factors - particularly in low and moderate rainfall environments.
Recommendations
In regions receiving 700mm or more of rainfall, expect higher baseline yields - but the addition of fertilizer and irrigation can further enhance results
In regions with moderate or low rainfall, agricultural practices like fertilizer and irrigation should be prioritized to mitigate environmental limitations and maintain productivity
Final Thoughts
This project gave me the opportunity to combine exploratory analysis, geospatial visualization, and machine learning techniques to uncover what drives crop yield across different regions and conditions. It was especially valuable to explore how predictive modeling can validate assumptions and surface practical recommendations.
What I Would Improve
Strengthen my overall machine learning skills, especially in tuning and validating models
Explore multivariate regression with more advanced feature engineering


Comments