# CAWaR - CAWa project R package

### CAWaR - Supporting Sustainable Water Usage Through Open-source Software

CAWaR1 was built for scientists and public authorities in Central Asia to guide crop type classification and mapping in support of water use efficiency. In the context of climate change, the risk of water scarcity is a lingering issue within arid regions such as Central Asia. Here, agricultural production is highly dependent on irrigation and sensitive in periods of water scarcity, which can potentially lead to food insecurity. As a result, understanding the spatial and temporal distribution of crops is essential offering and understanding of local water requirements and helping build more efficient water management plans.

### Example data

CAWaR is closely related to fieldRS. As a result, it makes use of some of its example data, namely:

• ndvi - RasterStack with 5 time steps of Normalized Difference Vegetation Index (NDVI) images.

• fieldData - SpatialPolygonsDataFrame containing ground truth data on crop types.

• referenceProfiles - data.frame with NDVI profiles for target crop types.

Most data can be accessed with the data() function with the exception of the raster data. This is because raster data is stored in the temporary memory when loaded and it cannot be saved as an R object. Below we can see how to load each dataset into R.

ndvi.ts <- brick(system.file("extdata", "ndvi.tif", package="fieldRS")) # NDVI raster time series
data(fieldData) # ground truth data
data(referenceProfiles) # target crop types NDVI profiles

### Can I use my ground-truth data?

To perform a cropland classification we require samples on different crop types that help us describe their temporal and spatial variability. To do so, we often travel to the field to collect ground-truth data using Geographic Positioning Systems (GPS) and visual interpretation. This data is valuable and provides us with precise information that is useful to train a classifier and validate its results. However, this data often comes with certain inconsistencies. When digitizing ground-truth data, errors such as the misspelling of classes and the overlapping of samples can occur and subsequently hinder future processing steps leading to time-consuming re-runs. To aid in the identification of errors, we developed a series of tools:

• checkSamples() - Checks for the information provided by a shapefile containing ground-truth data and checks for relevant fields and their formats.

• checkLabels() - Checks for misspellings in sample labels and helps correct them in an efficient and timely manner.

• geCheck() - Checks for spatial overlaps between polygons.

Let’s first start with checkSamples(). This function checks if a shapefile of ground-truth data contains relevant information fields. The function searches for the following variables:

• sampler - Name of the responsible for the sample collection.
• date - Date vector with sample acquisition date in yyyy-mm-dd format.
• label - Character vector with class of filed sample.

While knowing the label is a clear requirement, knowing the date and the name of the sampler can also be useful. First, knowing the sampling date can clear doubts when checking samples visually. In areas where intra-annual crop rotation occurs, samples with date information can help us better decipher complex temporal profiles created by multiple growth cycles related to different crops. Second, knowing the name of the sampler can help us clear additional doubts. Often, different actors assume the responsibility for collecting and analyzing the ground-truth data. Consequently, the second may lack knowledge on the sampling sites that can be provided by the sampler. As shown below, calling checkSamples() will return a data.frame reporting on the existence of the nominated fields and the consistency of their format.

sampleTest <- checkSamples(fieldData)
sampler date label
PASSED PASSED ERROR (missing “label” column)

Once we clear this step, we can check for other common issues. First, let us check for geometric errors. For example, if we conduct multiple field campaigns in the same area, we sometimes will find that we sampled a crop field more than once leading to duplicated samples. In order to address these issues, we can use geCheck(). The function will test all the polygons in a shapefile against each other, report on any existing overlaps and display spatial overlaps. Let us test the function.

sampleTest <- geCheck(fieldData)
polygon_1 polygon_2
2 27 1
4 32 2

The output shows us an empty data.frame. This is because fieldData has no geometric errors. However, what if two polygons overlap? To test this we will create a shapefile with two overlapping polygons and test geCheck() again.

# build polygons
p1 <- Polygons(list(Polygon(data.frame(x=c(1, 5, 10, 2, 1), y=c(10, 9, 8, 7, 10)))), ID=1)
p2 <- Polygons(list(Polygon(data.frame(x=c(2, 6, 5, 4, 2), y=c(10, 9, 7, 4, 10)))), ID=2)
p <- SpatialPolygons(list(p1, p2))

# check overlap
sampleTest <- geCheck(p)
polygon_1 polygon_2
2 2 1

The function returns a data.frame and a SpatialPolygons object. The first shows the polygon indices that overlap with each other. The second shows the overlapping area between them - shown in red in the example figure.

After we checked for the content and geometric errors of the ground-truth data, we need to check the labels of the samples as misspellings can occur when digitizing ground-truth data. However, this task can be tiring and can perpetuate mistakes due to human error. When looking at hundreds (or even thousands) of samples, we often lose track of the existing labels and of our own corrections. To anticipate these issues, we can use labelCheck(). Initially, the function will provide the unique labels in a vector of labels. In the case of fieldData, this information is contained in the field crop. Running the function over this data, we obtain a data.frame with an analysis of the distribution of samples per unique label and a plot of the same data made with ggplot. This means that the plot is editable and customizable to e.g. a user’s scientific reports. The function provides the unique labels as a single vector. Looking at the output, we can check if there are any misspellings.

sampleCorrect <- labelCheck(fieldData$crop) sampleCorrect$labels # unique labels
#> [1] "wheat"     "cotton"    "bare land"
count label
wheat 17 wheat
cotton 14 cotton
bare land 4 bare land

In this case, we do not seem to have any of these issues. However, for the purpose of this analysis, let us assume we wish to classify “wheat” and “not-wheat”. This means we need to rename the classes “cotton” and “bare land”. To achieve this, we can now provide labelCheck() with a corrected set of labels that matches the set of original labels in length. We will assign it to a new field in fieldData called crop_2. In this case, the output labels will report a full set of corrected labels instead of the unique occurrences. Additionally, as we saw before, we will obtain a plot with the count of samples per class.

sampleCorrect <- labelCheck(fieldData$crop, sampleCorrect$labels, c("wheat", "not-wheat", "not-wheat"))
fieldData@data$crop_2 <- sampleCorrect$labels
count label
wheat 17 wheat
not-wheat 18 not-wheat

The case we reported assumes that we do not know which classes are expected and that we need to check the output manually. However, when we already know which classes to expect, we can set the auto argument to TRUE. When doing so, the function will use a reference set of labels and compare them against the original ones using the stringdist() function of the R package with the same name. This function computes pairwise string distances between the target and reference labels and labelCheck() returns the label with the smallest distance.

### Extracting temporal profiles

The crop classification algorithm built in the scope of the CAWa project distinguishes crop types based on their unique phenological behaviour. To characterize the growth cycle of a particular crop, polygon-based ground-truth data is quite informative. Since single pixels can be misleading due to e.g. uneven growth patterns within a field, we use all pixels within a polygon and summarize them into a single time-series. However, when dealing with polygons, not all overlapping pixels are useful. Along the borders of a sampled field, we might note - as depicted in the image below - that some pixels are only partially covered and potentially shared by fields of different crops. Therefore, a simple averaging of the pixels might not be sufficient summarize the temporal variable of the crop within the polygon efficiently.

Example figure showing the change in the percent overlap between a polygon and a reference raster grid.

To address this issue, we need to use three functions. First, we are going to use poly2sample() from the fieldRS R package. This function loops through each polygon is a SpatialPolygons object and returns all overlapping pixels as a SpatialPointsDataFrame. Additionally, for each sample, the function will inform on the percent overlap between the corresponding pixel and the reference polygon. Remember to set the preserve.id argument to TRUE to make sure the polygon ID’s are kept and reported.

fieldData2 <- poly2sample(fieldData, ndvi.ts, preserve.id=TRUE)

The SpatialPointsDataFrame object can then be provided to extractTS(). This function will iterate through each polygon in a shapefile, extract the values of series of RasterLayers and derive mean time series for each polygon weighted by the percent cover of each corresponding pixel. If the raster objects are provided as a RasterStack, the function will use the extract() function of the raster package to extract its values. However, if your satellite data has different extents and/or projections (e.g. Landsat) reprojecting and cropping each raster layer can be time-consuming. To avoid this, you can provide a list of pre-read RasterLayer objects instead. In this care, extractTS() will use the extract2() function which iterates through each element in the list and uses the extract() function. Internally, the extract() function reprojects the reference samples to match the RasterLayer when necessary. Once this data is collected, extractTS() will estimate a weighted mean profile for each unique identifier. In this process, border pixels will contribute less for the final time series than the center ones minimizing the influence of mixed-pixels. Let’s apply this function to fieldData2. As environmental predictors, we will use the ndvi.ts RasterBrick provided through fieldRS. The function returns a weighted-mean time-series as well as the original pixel - with coordinates and percent cover values - and a summary of the each, original polygon reporting on the min, max and mean cover and the count of pixels.

fieldDataTS <- extractTS(fieldData2, ndvi.ts, fieldData2$cover, fieldData2$id)
id x y cov
1 706410 4483770 2
1 706440 4483770 19
1 706470 4483770 10
1 706410 4483740 10
1 706440 4483740 100
id x y min.cover max.cover mean.cover
1 706558.0 4483537 2 100 83.35948
2 706676.3 4484546 3 100 82.67692
3 707354.7 4484683 3 100 86.79327
4 707945.6 4484762 2 100 87.55051
5 707999.5 4484168 2 100 90.07563
ndvi.1 ndvi.2 ndvi.3 ndvi.4 ndvi.5
2673.9652 5098.196 2820.041 6307.327 2952.550
759.9144 1311.357 2679.062 7259.830 2376.670
3128.0557 5735.728 2691.121 5771.713 1834.939
2757.6715 4967.065 2752.301 7505.234 1942.889
2058.3882 4617.932 3486.321 6165.167 2117.259

### Building reference profiles: How should each crop type look like?

Before, we learned how to deal with misspellings. Now, we will learn how to deal with mislabeling. This is a frequent issue. When on the field, some crops can look very similar - especially when in early stages of growth - becoming hard to distinguish. Therefore, the sampler can report on the wrong label. When using mislabeled samples for classification, we might face poor validation results and it can be hard - if not impossible - to improve the classification results as long as such mistakes remain uncorrected. Thus, it is important that we check each polygon a priori to assure its veracity. analyseTS()can help us in this task. The function will derive statistics for each unique class in fieldData and use the median of each time-step to build a plot. While we expect some samples are mislabeled, we assume most of them are correct and that the median offers an accurate depiction of a reference profile.

checkTS1 <- analyseTS(as.data.frame(fieldDataTS$weighted.mean), fieldData$crop_2)

Additionally, the function will correlate the profile of every sample with the median profiles of each unique class. This will ease the time requirement of looking at every sample. Assuming that correct samples will be highly correlated with the median profiles, poor correlations will like point to samples for which labels need correction. If this happens, we can again use checkLabels() for the correction.

If the user wishes to check the profiles for every sample, extractTS() can still be useful. Instead of providing classes, the user can provide unique ID’s for each polygon. This way, the function will build a plot for each sample which can be saved separately using ggsave.

checkTS2 <- analyseTS(as.data.frame(fieldDataTS$weighted.mean), as.character(1:length(fieldData))) for (p in 1:length(fieldData)) {ggsave(checkTS2$plots[[p]], paste0("./", checkTS2$labels[p], ".png"), width=10, height=10, units="cm")} Double-checking the quality of the NDVI temporal profiles is indeed necessary. In our experience, apart from the occasional misspelling, we often find mislabeling. Such mistakes hinder the quality of any predictive model and can be difficult to track. However, looking through thousands of plots can be exhausting leading us to commit additional errors. To minimize this workload, we can use compareLabel(). This function Will compare each row in a data.frame (e.g. weighted mean time-series derived with extractTS()) against a set of reference profiles (e.g. median profiles derived with analizeTS()). The function correlates the target and reference profiles and compares the labels for the profiles with the highest correlation. If not the same, the function will return NA. Additionally, the function reports on the distribution on NA values helping the user decide which profiles remain unclear. Below we seen an example. # retrieve reference profiles as a data.frame reference.profiles <- as.data.frame(do.call(rbind, lapply(checkTS1$y.statistics, function(i) {i$median}))) # compare original data and its statistical reference cl <- compareLabel(as.data.frame(fieldDataTS$weighted.mean), reference.profiles, fieldData$crop_2, checkTS1$labels)
x.label y.label pearson compare
wheat wheat 0.95 TRUE
not-wheat not-wheat 1.00 TRUE
wheat wheat 0.92 TRUE
wheat wheat 0.98 TRUE
wheat wheat 0.99 TRUE

The output shows us the original label (x.label) the most likely label based on the reference data (y.label), the pearson correlation between the target and reference (pearson) and a logical variable (compare) which shows if the original and predicted labels match.

### PhenoCropVal: A new concept of land cover validation!

Up to this point we completed the pre-processing of fieldData and derived reference profiles for the classes that should be distinguished. However, before we build a map from this, we might want to check how good we expect the classification to be. This avoids time-consuming raster processing and helps anticipate the need for further pre-processing. phenoCropVal() does just that. In practice, this algorithm will use analyseTS() to build reference profiles. The main difference is in how it separates training and validation data. The function requires a vector showing which samples to group. Then, it keeps each group for validation while the remaining samples are used to build reference profiles for each class using analyseTS(). The function will then call phenoCropClass() to classify the validation profiles and will identify the samples that passed and failed the test. The function repeats this process for each group in each class. For each class, the function will derive an F1-score as an accuracy measure, estimated from the total amount of true and false positives collected at each iteration.

Depiction of the validation process for one class. One group is kept for validation while the remaining ones are used for training.

To assure that neighboring - and therefore spatially auto-correlated - samples are not kept together during validation, we will first use splitSamples(). This function performs a spatial clustering approach labeling samples within predefined spatial distances as part of the same group. The function provides a vector with the region label for each sample and an account of the pixel frequency per sample.

fieldDataCluster <- splitSamples(fieldData, ndvi.ts, fieldData$crop_2, agg.radius=60) id count wheat_001 717 wheat_002 276 wheat_003 191 wheat_004 1116 wheat_005 367 Now we can use phenoCropVal() to check the accuracy of our predicitons. The function will return a data.frame with class wise F1-scores as well and a plot based on it. cropVal <- phenoCropVal(as.data.frame(fieldDataTS$weighted.mean), fieldData$crop_2, fieldDataCluster$region.id)
class count precision recall f1
wheat wheat 17 0.71 0.57 0.63
not-wheat not-wheat 18 0.50 0.64 0.56

Additionally, the function will identify which samples were correctly and incorrectly classified through a logical vector named sample.validation. in combination with fieldData, this result helps us perceive the spatial distribution of errors and, in some cases, might even help us identify poor quality samples that missed the initial checks.

fieldData$validation <- as.factor(cropVal$sample.validation)
spplot(fieldData["validation"])

1. CAWaR was developed at the University of Würzburg, Germany, as part of the project Research Network Water in Central Asia (CAWa) funded by the German Federal Foreign Office. It is part of the German Water Initiative for Central Asia (the so-called “Berlin process”), which the German Federal Foreign Office launched on 1 April 2008 in order to support peaceful, sustainable and mutually beneficial management of transboundary water resources in Central Asia.