Object-Based Classification of Sentinel-2 Data Using Free and Open-Source Machine Learning and GIS Tools

2021 ◽  
pp. 771-780
Author(s):  
Harpinder Singh ◽  
Ajay Roy ◽  
Shashikant Patel ◽  
Brijendra Pateriya
2018 ◽  
Vol 10 (9) ◽  
pp. 1419 ◽  
Author(s):  
Mathias Wessel ◽  
Melanie Brandmeier ◽  
Dirk Tiede

We use freely available Sentinel-2 data and forest inventory data to evaluate the potential of different machine-learning approaches to classify tree species in two forest regions in Bavaria, Germany. Atmospheric correction was applied to the level 1C data, resulting in true surface reflectance or bottom of atmosphere (BOA) output. We developed a semiautomatic workflow for the classification of deciduous (mainly spruce trees), beech and oak trees by evaluating different classification algorithms (object- and pixel-based) in an architecture optimized for distributed processing. A hierarchical approach was used to evaluate different band combinations and algorithms (Support Vector Machines (SVM) and Random Forest (RF)) for the separation of broad-leaved vs. coniferous trees. The Ebersberger forest was the main project region and the Freisinger forest was used in a transferability study. Accuracy assessment and training of the algorithms was based on inventory data, validation was conducted using an independent dataset. A confusion matrix, with User´s and Producer´s Accuracies, as well as Overall Accuracies, was created for all analyses. In total, we tested 16 different classification setups for coniferous vs. broad-leaved trees, achieving the best performance of 97% for an object-based multitemporal SVM approach using only band 8 from three scenes (May, August and September). For the separation of beech and oak trees we evaluated 54 different setups, the best result achieved an accuracy of 91% for an object-based, SVM, multitemporal approach using bands 8, 2 and 3 of the May scene for segmentation and all principal components of the August scene for classification. The transferability of the model was tested for the Freisinger forest and showed similar results. This project points out that Sentinel-2 had only marginally worse results than comparable commercial high-resolution satellite sensors and is well-suited for forest analysis on a tree-stand level.


2021 ◽  
Vol 13 (5) ◽  
pp. 937
Author(s):  
Payam Najafi ◽  
Bakhtiar Feizizadeh ◽  
Hossein Navid

Conservation tillage methods through leaving the crop residue cover (CRC) on the soil surface protect it from water and wind erosions. Hence, the percentage of the CRC on the soil surface is very critical for the evaluation of tillage intensity. The objective of this study was to develop a new methodology based on the semiautomated fuzzy object based image analysis (fuzzy OBIA) and compare its efficiency with two machine learning algorithms which include: support vector machine (SVM) and artificial neural network (ANN) for the evaluation of the previous CRC and tillage intensity. We also considered the spectral images from two remotely sensed platforms of the unmanned aerial vehicle (UAV) and Sentinel-2 satellite, respectively. The results indicated that fuzzy OBIA for multispectral Sentinel-2 image based on Gaussian membership function with overall accuracy and Cohen’s kappa of 0.920 and 0.874, respectively, surpassed machine learning algorithms and represented the useful results for the classification of tillage intensity. The results also indicated that overall accuracy and Cohen’s kappa for the classification of RGB images from the UAV using fuzzy OBIA method were 0.860 and 0.779, respectively. The semiautomated fuzzy OBIA clearly outperformed machine learning approaches in estimating the CRC and the classification of the tillage methods and also it has the potential to substitute or complement field techniques.


2021 ◽  
Vol 13 (9) ◽  
pp. 4728
Author(s):  
Zinhle Mashaba-Munghemezulu ◽  
George Johannes Chirima ◽  
Cilence Munghemezulu

Rural communities rely on smallholder maize farms for subsistence agriculture, the main driver of local economic activity and food security. However, their planted area estimates are unknown in most developing countries. This study explores the use of Sentinel-1 and Sentinel-2 data to map smallholder maize farms. The random forest (RF), support vector (SVM) machine learning algorithms and model stacking (ST) were applied. Results show that the classification of combined Sentinel-1 and Sentinel-2 data improved the RF, SVM and ST algorithms by 24.2%, 8.7%, and 9.1%, respectively, compared to the classification of Sentinel-1 data individually. Similarities in the estimated areas (7001.35 ± 1.2 ha for RF, 7926.03 ± 0.7 ha for SVM and 7099.59 ± 0.8 ha for ST) show that machine learning can estimate smallholder maize areas with high accuracies. The study concludes that the single-date Sentinel-1 data were insufficient to map smallholder maize farms. However, single-date Sentinel-1 combined with Sentinel-2 data were sufficient in mapping smallholder farms. These results can be used to support the generation and validation of national crop statistics, thus contributing to food security.


2014 ◽  
Vol 6 (6) ◽  
pp. 5019-5041 ◽  
Author(s):  
José Peña ◽  
Pedro Gutiérrez ◽  
César Hervás-Martínez ◽  
Johan Six ◽  
Richard Plant ◽  
...  

2021 ◽  
Author(s):  
Mary B. Makarious ◽  
Hampton L. Leonard ◽  
Dan Vitale ◽  
Hirotaka Iwaki ◽  
Lana Sargent ◽  
...  

SUMMARYBackgroundPersonalized medicine promises individualized disease prediction and treatment. The convergence of machine learning (ML) and available multi-modal data is key moving forward. We build upon previous work to deliver multi-modal predictions of Parkinson’s Disease (PD).MethodsWe performed automated ML on multi-modal data from the Parkinson’s Progression Marker Initiative (PPMI). After selecting the best performing algorithm, all PPMI data was used to tune the selected model. The model was validated in the Parkinson’s Disease Biomarker Program (PDBP) dataset. Finally, networks were built to identify gene communities specific to PD.FindingsOur initial model showed an area under the curve (AUC) of 89.72% for the diagnosis of PD. The tuned model was then tested for validation on external data (PDBP, AUC 85.03%). Optimizing thresholds for classification, increased the diagnosis prediction accuracy (balanced accuracy) and other metrics. Combining data modalities outperforms the single biomarker paradigm. UPSIT was the largest contributing predictor for the classification of PD. The transcriptomic data was used to construct a network of disease-relevant transcripts.InterpretationWe have built a model using an automated ML pipeline to make improved multi-omic predictions of PD. The model developed improves disease risk prediction, a critical step for better assessment of PD risk. We constructed gene expression networks for the next generation of genomics-derived interventions. Our automated ML approach allows complex predictive models to be reproducible and accessible to the community.FundingNational Institute on Aging, National Institute of Neurological Disorders and Stroke, the Michael J. Fox Foundation, and the Global Parkinson’s Genetics Program.RESEARCH IN CONTEXTEvidence before this studyPrior research into predictors of Parkinson’s disease (PD) has either used basic statistical methods to make predictions across data modalities, or they have focused on a single data type or biomarker model. We have done this using an open-source automated machine learning (ML) framework on extensive multi-modal data, which we believe yields robust and reproducible results. We consider this the first true multi-modality ML study of PD risk classification.Added value of this studyWe used a variety of linear, non-linear, kernel, neural networks, and ensemble ML algorithms to generate an accurate classification of both cases and controls in independent datasets using data that is not involved in PD diagnosis itself at study recruitment. The model built in this paper significantly improves upon our previous models that used the entire training dataset in previous work1. Building on this earlier work, we showed that the PD diagnosis can be refined using improved algorithmic classification tools that may yield potential biological insights. We have taken careful consideration to develop and validate this model using public controlled-access datasets and an open-source ML framework to allow for reproducible and transparent results.Implications of all available evidenceTraining, validating, and tuning a diagnostic algorithm for PD will allow us to augment clinical diagnoses or risk assessments with less need for complex and expensive exams. Going forward, these models can be built on remote or asynchronously collected data which may be important in a growing telemedicine paradigm. More refined diagnostics will also increase clinical trial efficiency by potentially refining phenotyping and predicting onset, allowing providers to identify potential cases earlier. Early detection could lead to improved treatment response and higher efficacy. Finally, as part of our workflow, we built new networks representing communities of genes correlated in PD cases in a hypothesis-free manner, showing how new and existing genes may be connected and highlighting therapeutic opportunities.


2021 ◽  
Vol 24 (44) ◽  
pp. 70-83
Author(s):  
Gonzalo Rodolfo Peña-Zamalloa

The city of Huancayo, like other intermediate cities in Latin America, faces problems of poorly planned land-use changes and a rapid dynamic of the urban land market. The scarce and outdated information on the urban territory impedes the adequate classification of urban areas, limiting the form of its intervention. The purpose of this research was the adoption of unassisted and mixed methods for the spatial classification of urban areas, considering the speculative land value, the proportion of urbanized land, and other geospatial variables. Among the data collection media, Multi-Spectral Imagery (MSI) from the Sentinel-2 satellite, the primary road system, and a sample of direct observation points, were used. The processed data were incorporated into georeferenced maps, to which urban limits and official slopes were added. During data processing, the K-Means algorithm was used, together with other machine learning and assisted judgment methods. As a result, an objective classification of urban areas was obtained, which differs from the existing planning.


2019 ◽  
Vol 1 ◽  
pp. 1-2
Author(s):  
Jan Wilkening

<p><strong>Abstract.</strong> Data is regarded as the oil of the 21st century, and the concept of data science has received increasing attention in the last years. These trends are mainly caused by the rise of big data &amp;ndash; data that is big in terms of volume, variety and velocity. Consequently, data scientists are required to make sense of these large datasets. Companies have problems acquiring talented people to solve data science problems. This is not surprising, as employers often expect skillsets that can hardly be found in one person: Not only does a data scientist need to have a solid background in machine learning, statistics and various programming languages, but often also in IT systems architecture, databases, complex mathematics. Above all, she should have a strong non-technical domain expertise in her field (see Figure 1).</p><p>As it is widely accepted that 80% of data has a spatial component, developments in data science could provide exciting new opportunities for GIS and cartography: Cartographers are experts in spatial data visualization, and often also very skilled in statistics, data pre-processing and analysis in general. The cartographers’ skill levels often depend on the degree to which cartography programs at universities focus on the “front end” (visualisation) of a spatial data and leave the “back end” (modelling, gathering, processing, analysis) to GIScientists. In many university curricula, these front-end and back-end distinctions between cartographers and GIScientists are not clearly defined, and the boundaries are somewhat blurred.</p><p>In order to become good data scientists, cartographers and GIScientists need to acquire certain additional skills that are often beyond their university curricula. These skills include programming, machine learning and data mining. These are important technologies for extracting knowledge big spatial data sets, and thereby the logical advancement to “traditional” geoprocessing, which focuses on “traditional” (small, structured, static) datasets such shapefiles or feature classes.</p><p>To bridge the gap between spatial sciences (such as GIS and cartography) and data science, we need an integrated framework of “spatial data science” (Figure 2).</p><p>Spatial sciences focus on causality, theory-based approaches to explain why things are happening in space. In contrast, the scope of data science is to find similar patterns in big datasets with techniques of machine learning and data mining &amp;ndash; often without considering spatial concepts (such as topology, spatial indexing, spatial autocorrelation, modifiable area unit problems, map projections and coordinate systems, uncertainty in measurement etc.).</p><p>Spatial data science could become the core competency of GIScientists and cartographers who are willing to integrate methods from the data science knowledge stack. Moreover, data scientists could enhance their work by integrating important spatial concepts and tools from GIS and cartography into data science workflows. A non-exhaustive knowledge stack for spatial data scientists, including typical tasks and tools, is given in Table 1.</p><p>There are many interesting ongoing projects at the interface of spatial and data science. Examples from the ArcGIS platform include:</p><ul><li>Integration of Python GIS APIs with Machine Learning libraries, such as scikit-learn or TensorFlow, in Jupyter Notebooks</li><li>Combination of R (advanced statistics and visualization) and GIS (basic geoprocessing, mapping) in ModelBuilder and other automatization frameworks</li><li>Enterprise GIS solutions for distributed geoprocessing operations on big, real-time vector and raster datasets</li><li>Dashboards for visualizing real-time sensor data and integrating it with other data sources</li><li>Applications for interactive data exploration</li><li>GIS tools for Machine Learning tasks for prediction, clustering and classification of spatial data</li><li>GIS Integration for Hadoop</li></ul><p>While the discussion about proprietary (ArcGIS) vs. open-source (QGIS) software is beyond the scope of this article, it has to be stated that a.) many ArcGIS projects are actually open-source and b.) using a complete GIS platform instead of several open-source pieces has several advantages, particularly in efficiency, maintenance and support (see Wilkening et al. (2019) for a more detailed consideration). At any rate, cartography and GIS tools are the essential technology blocks for solving the (80% spatial) data science problems of the future.</p>


2021 ◽  
Vol 13 (16) ◽  
pp. 3176
Author(s):  
Beata Hejmanowska ◽  
Piotr Kramarczyk ◽  
Ewa Głowienka ◽  
Sławomir Mikrut

The study presents the analysis of the possible use of limited number of the Sentinel-2 and Sentinel-1 to check if crop declarations that the EU farmers submit to receive subsidies are true. The declarations used in the research were randomly divided into two independent sets (training and test). Based on the training set, supervised classification of both single images and their combinations was performed using random forest algorithm in SNAP (ESA) and our own Python scripts. A comparative accuracy analysis was performed on the basis of two forms of confusion matrix (full confusion matrix commonly used in remote sensing and binary confusion matrix used in machine learning) and various accuracy metrics (overall accuracy, accuracy, specificity, sensitivity, etc.). The highest overall accuracy (81%) was obtained in the simultaneous classification of multitemporal images (three Sentinel-2 and one Sentinel-1). An unexpectedly high accuracy (79%) was achieved in the classification of one Sentinel-2 image at the end of May 2018. Noteworthy is the fact that the accuracy of the random forest method trained on the entire training set is equal 80% while using the sampling method ca. 50%. Based on the analysis of various accuracy metrics, it can be concluded that the metrics used in machine learning, for example: specificity and accuracy, are always higher then the overall accuracy. These metrics should be used with caution, because unlike the overall accuracy, to calculate these metrics, not only true positives but also false positives are used as positive results, giving the impression of higher accuracy. Correct calculation of overall accuracy values is essential for comparative analyzes. Reporting the mean accuracy value for the classes as overall accuracy gives a false impression of high accuracy. In our case, the difference was 10–16% for the validation data, and 25–45% for the test data.


Sign in / Sign up

Export Citation Format

Share Document