Blog: The Importance of Good Data in Satellite Imagery Analysis
Introduction
The phrase "Garbage-In-Garbage-Out" is well-known in data science, but it takes on new meaning in real-world projects, especially those involving satellite imagery. In these contexts where ground truth labels are often sparse, preprocessing becomes not just a step, but a cornerstone of success. Reflecting on my project, I have realized that understanding and preparing data account for about 70% (if not more) of the work and determines the quality of the results. Data preprocessing ensures that the inputs to your model are clean, structured, and tailored to the problem at hand. This is true for all kinds of data, whether tabular data, text, or images. However, the proper preprocessing steps come from initially "looking" at the data. That means preprocessing is dependent on the data and the task at hand.
Understanding Satellite Images
Satellite images are far more than just pictures; they encode a wealth of information about the spectral signature of a location. A spectral signature represents the energy reflected from an area including all the objects within it. For example, forests and urban regions reflect light differently, creating unique spectral signatures to help us distinguish between them. These unique characteristics make satellite imagery a powerful tool for understanding the world from above.
However, the richness of information also brings challenges in extraction due to its multi-channel data, noise, and variability. While Convolutional Neural Networks (CNNs) are powerful for tasks involving natural images, they often struggle to generalize on satellite images unless the data is carefully preprocessed and features are thoughtfully engineered.
Lessons from My Work
In my work; analyzing parking lot occupancy, I initially turned to CNNs. My goal was to train the models to identify and learn patterns from satellite images of parking lots. To my frustration, the models failed to capture the specific information I wanted, with accuracy barely better than random guesses. This led me to a critical realization: the problem was not with the models themselves but with how I handled the data.
I decided to shift my approach. Instead of relying solely on CNNs, I transformed the data from its array-based image representation into tabular data by extracting meaningful features from the images, like standard deviation, mean, median, max and min values that describe each image channel.
These features were inputs to conventional machine learning models like random forests and gradient boosting. This change made the models more robust and allowed them to handle the inherent noise and variability in the data more effectively.
Key Takeaways
Understand your Data: Spend time exploring and analyzing the data before modelling. The better you understand its quirks and characteristics, the more informed your decisions will be.
Preprocessing is Foundational: Clean, structured, and tailored data makes a significant difference. Translating raw data into a format that highlights its most relevant aspects can drastically improve model performance. Converting satellite imagery into tabular features has unlocked an understanding I could not get from using the raw images.
Garbage-In-Garbage-Out: The quality of your input data sets the ceiling for your model performance.
Good Data is Key: The quality of your data often determines the upper bound of your model’s performance. Invest time in cleaning, preprocessing, and engineering features to set your project up for success.
Model Selection is Context-Dependent: Do not default to complex models. Simpler approaches, such as random forests, often outperform deep learning when data are noisy or limited.
Final Thoughts
While model selection is essential, they are only as good as the data they are fed. This is particularly true in satellite imagery analysis, where the richness and complexity of the data demand careful preprocessing and thoughtful feature engineering. My journey taught me that even the most sophisticated models cannot compensate for poor data. Spending time preprocessing and understanding the data has improved my results and deepened my understanding of the problem.
Written by Theophilus Aidoo