The article discusses the innovative use of the Segment Anything Model (SAM) for geotechnical interpretation of rock slopes at the Zentrum Am Berg (ZaB). Traditional methods rely on in-person visual techniques, but this study explores the potential of machine learning, specifically transformer models, to automate and enhance rock mass characterization. The SAM model, originally developed for large language models, is applied to segment rock faces based on user prompts, offering advantages over conventional Convolutional Neural Networks (CNNs) due to its attention mechanisms and global perspective. The study outlines a methodology involving interpretation, segmentation, and evaluation stages, highlighting the challenges and potential of this approach. The results show promise in automating geological segmentation tasks, with future work focusing on refining the model's capabilities and expanding the training dataset.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
The Segment Anything Model (SAM) introduces advanced transformer-based capabilities for geological image segmentation. While traditional geoscience applications rely on machine learning models like random forests and support vector machines, SAM’s attention mechanisms enable it to adapt to image data. This contribution evaluates SAM’s performance in segmenting rock outcrop images into three geological classes, using ground truth masks as references. Segmentation accuracy was assessed via intersection over union (IoU) scores across prompt types, including points and bounding boxes. A combination of bounding box and mask prompts provided the best results, particularly for large, distinct textures. Initial findings indicate SAM’s potential in geological segmentation, though further prompt refinement and expanded datasets are needed to address rock heterogeneity. Future work will focus on fine-tuning SAM for complex textures and integrating Laserscan-derived data for quantitative validation. This contribution underscores SAM’s promise in advancing automated geological segmentation applications.
Notes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Rock mass characterization is a necessary step for rock engineering purposes, such as identifying targets for geothermal energy exploitation, slope stability assessment and excavation design. This is conventionally performed in-person using visual techniques and characterization schemes. While Machine learning (ML) and Deep Learning techniques have been extensively used in geological and geotechnical settings, they focus on tabulated, derived data, such as chemical composition or RQD parameters, employing mostly multilayer perceptron (MLP), Random Tree/Forest algorithms, Support vector machines (SVMs) [1, 2] rather than directly image RGB data, which is used by the geologist for on-site rock characterization. The goal is to evaluate a ML model for its ability to “see” and segment a rock face akin to how a geologist would “see” and characterize it. For this reason, a Computer Vision (CV) tool is required. CV based approaches were, until recently, limited to Convolutional Neural Networks (CNNs) [3, 4], however, the application of Transformer architecture, originally developed for large language models (LLMs), has now extended to CV as well. One of the most promising recent developments in this field was the introduction of the Segment Anything Model (SAM) by Meta AI Research in April 2023. The segmentation process in SAM refers to dividing an image into distinct regions or objects based on user prompts, where each pixel is classified with a probability of belonging to the user-defined object of interest. SAM can produce a heat map of probabilities for the queried object across the input image, in addition to segmenting and delineating the boundaries of that object based on the prompts provided [5]. This segmentation can be leveraged for geological interpretative tasks.
Transformers, such as those used in the Segment Anything Model (SAM), offer advantages over Convolutional Neural Networks (CNNs) due to their attention mechanisms, which allow the model to capture relationships across entire images. By dividing an image into patches represented as high-dimensional tokens with added positional encodings, SAM can understand spatial relationships and model long-range dependencies within an image. This global perspective enables SAM to detect complex visual patterns more effectively than CNNs, which are limited by their local filters and fixed receptive fields. CNNs often require deep architectures to approximate global relationships, leading to potential information loss in pooling layers.
Advertisement
However, transformers also have drawbacks. Their attention mechanisms and token-based structure demand substantial computational resources, making them more challenging to train and deploy than CNNs. Additionally, transformers require extensive datasets to generalize effectively, which can be a limitation when there is less data available.
In segmentation, user input is required to detect the queried object. Thus, the preliminary goal is to use SAM as an interpretative tool for the segmentation of rock faces, not to eliminate the need for a geologist.
2 Used Data and Methodology
Since SAM bases its segmentation result on three features, the RGB ratio of a 2D picture, the input data are 60,400 by 400 pixel pictures of a rock face sliced from a 4000 by 2250 picture made by a drone [6] for the original purpose of photogrammetry. The pictures taken for this project were taken from the rock slopes of the Erzberg, in the immediate surroundings of the ZaB—Zentrum am Berg research facility associated with the Montanuniversität Leoben.
As the goal is to segment various rock facies types by using as few input prompts as possible, the following workflow steps have been taken, divided into three broadcategories: Interpretation → Segmentation → Evaluation (Fig. 1). The clockwise rotating circle arrows signify the first iteration of the process, where a picture is interpreted and assigned prompts for SAM. After this first part of the workflow, the results of the evaluation are propagated back through first the segmentation and potentially interpretation stages, until a prompting creation and interpretation strategy is found that satisfies both the geological context as well as the requirements of SAM for the creation of a qualitative mask. This “backpropagation” is visualized by the counterclockwise outer arrows.
The image has to be segmented by the observer according to reproducible criteria. These have their basis in visual differences between parts of the rock face, as the SAM algorithm use RGB features as input (Fig. 2). For this reason a GSI—derived classification method has been chosen to best reflect the geological reality and the needs of the SAM algorithm.
Class 1: Rock mass with few, unoriented and widely spaced joints
Class 2: Rock mass with oriented joint set(s) resulting in a general orientation of the resulting blocks. The joint spacing varies from generally slightly lower joint distances than in class 1 to tightly spaced.
Class 3: Completely disturbed rock mass, where no unifying joint and block orientation is evident. Any bigger blocks that are present are classed as part of class 3 as long as no contact to more competent rock exists.
Fig. 2
Workflow diagram. 1 Interpretation result, 2 Input prompts based on the interpretation result, 3 Combined prediction results, visualized
Based on these criteria, the image is segmented into sections identifiable as one of the classes. These sections are called “ground truth masks” and serve as the reference point for evaluating the performance of the subsequent segmentation.
2.
Segmentation
The ground truth masks serve as the basis for the input prompt creation. For this purpose an algorithm has been developed that automatically generates sparse (point), dense (mask) and bounding box prompts based on the polygons of the ground truth mask. The ground truth mask can be sliced into multiple smaller segments to generate input prompts of a smaller area. For the resulting mask generation, the ViT_h version of SAM is used, which is the largest available model offering the highest accuracy and precision. This version has drawbacks; this being computing resources allocation and a higher requirement for training samples during fine tuning to avoid overfitting. The output is set to multimask, resulting in three masks per input, defined by their IoU score, a metric that evaluates the quality of the prediction. The SAM specific IoU is a predicted parameter and is unrelated to the calculated IoU score used in the evaluation.
3.
Evaluation
The resulting segmentation masks are visualized and evaluated based on the metrics of the IoU (intersection over Union) score. Different prompting options are evaluated and the best is picked. Should the result be unsatisfactory, a change in interpretation and/or prompting would be attempted. It is important to note that the evaluation is done for each predicted mask compared to the ground truth mask associated with the input area, and not for the combined predicted masks compared with the interpretation ground truth mask.
3 Results and Discussion
Based on the established methodology, an image is interpreted, creating the ground truth mask. Since rock masses, the amalgamation of several types of rock textures, and not singular textures, are the basis for interpretation, the question of scale at which a texture should be assigned a rock class was addressed by choosing pictures of 400 by 400 pixel size. Any texture areas that are bigger than 50% of the image and/or interconnected across multiple subimages to reach a similar size, can be interpreted as belonging to a rock class, whereas smaller and unconnected features are classified as being part of the greater rock mass surrounding them. The specific size of the images was chosen as having a acceptable tradeoff between detail and resolution.
As the polygons used for segmentation also represent the basis for the prompt calls, two different masking instances were done, varying in mask size, with iteration 1 incorporating nine masks and instance 2 fifteen masks. Coming to the prediction stage, different prompt combinations were attempted. As can be seen in Fig. 3, particularly image H, the overall point size is heavily dependent on the size of the queried area.
Fig. 3
All possible used prompt types for iteration 1. Green crosses denote positive labels for a mask, red crosses denote negative labels. The blue transparent shape represents the dense mask focus, the black rectangle represents the bounding box
Point prompts focus on the small, pixel adjacent area designated by the point location, thus being not in the same way focused on a specific area than rather on a specific point. This allows for more “flexible” prediction mask results, as opposed to the concentrated prompts bounding box and dense mask, which direct the SAM algorithm attention to not only a set of individual pixels, but to a connected set of pixels in a specified area, designated by said prompts.
Point prompts have been found to result in acceptably good predictions in cases of big (in relation to picture size), relatively homogenous/visually distinct textures, and/or where tight clustering of the negative labels prevents the predicted mask of expanding beyond the designated space. Alternatively, very small areas have the potential to be more completely represented even when using points (Fig. 4h).
Fig. 4
Instance 1, point inputs, top IoU: Notice the good performance of mask A and the worse performance of smaller masks. Red areas denote rock class 3, green class 2, blue class 1. Good approximations to the ground truth masks are seen in a and h, while the rest show poor association
Dense masks in conjunction with points have led to suboptimal results, with the mask input maximizing the area of coverage, striving to maximize the intersection part of the IoU metric. This leads to a dramatic inflation of the predicted mask size in relation to the ground truth mask. It was found however, that when picking predictions with a lower IoU score, the influence of the dense mask recenters the disparate point prompt prediction to the ground truth area of interest (Fig. 5).
Fig. 5
Instance 1, mask, points, lower IoU: Much better performance across all metrics, still with some mask overlap but generally very good intersection with ground truth mask
The best results are received by using a combination of mask and bounding boxes, helping to focus the system’s attention to a combined set of pixels instead of a disparate set in the point prompt case. Provided the polygons used for prompt creation are suitably shaped to be encompassed in their entirety by the bounding box, with not too much area within being taken up by a different rock class it was possible to concentrate SAM’s attention on the rock type at hand. In many cases this means creating smaller, focused prompts to limit a predicted mask generation beyond the ground truth mask (Fig. 6).
Fig. 6
Iteration 2, mask, bounding box (bbox), top IoU:. Combining dense mask and bounding box input in tandem with smaller focus areas leads to better results
Normalized sum of all predicted mask IoU results. Masks and points coupled with a lower IoU score led to similar results as mask and bounding boxes using the top native IoU score
Instance 1 (top IoU)
Instance 1 (lower IoU)
Instance 2 (top IoU)
Pred. Masks
Points
Mask, points
Bbox, mask
A
0.837931938
0.863994899
0.863994899
B
0.104166876
0.938427582
0.938427582
C
0.229329808
0.901047071
0.901047071
D
0.14328019
0.834392015
0.834392015
E
0.159525346
0.027214562
0.89650471
F
0.10716832
0.809286899
0.809286899
G
0.389433894
0.868844406
0.817624521
H
0.45659164
0.7960199
0.666030534
I
0.313600238
0.724316873
0.798584071
J
–
0.819097058
K
0.736296171
L
0.576058547
M
0.6562249
N
0.337074303
O
0.774137931
Norm.Sum
0.182735217
0.751504912
0.761652081
While small target areas offer a better result, such an approach is not sensible on a practical scale, where segmentation with less input over a larger area should result in a correct segmentation as well. For this goal, further work will be done, specifically in the area of fine-tuning segment shapes to better respond to more general types of prompts.
4 Outlook
In this training approach, the model would initially receive strong prompt guidance (e.g. points, boxes, dense prompts) to emphasize texture over shape, particularly for classes 1 and 2. As training progresses, prompts would be gradually reduced to encourage reliance on learned features, enhancing generalization. Additionally, subsegmenting larger regions into texture-focused areas is planned to expand the training data and improve texture differentiation, with selective subsegmentation for elongated shapes in class 3. To support texture-based segmentation without changing SAM’s RGB input, laser scan-derived normal vectors from rock surface meshes may be color-coded by vector orientation, providing a quantitative validation framework for SAM’s segmentation.
The goal of segmenting images into three distinct classes presents challenges, particularly in dataset quality, training strategy selection, and integrating SAM into a scalable, user-friendly workflow. This approach shows promise for partially automating geological segmentation, though expanding the training dataset to address rock heterogeneity will be a primary focus of the first author’s MSc thesis.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.