Skip to main content
Erschienen in:

Open Access 2025 | OriginalPaper | Buchkapitel

3-2-3 Multi-AI Segmentation Framework: LoD-Based, Incremental Segmentation of 3D Scan Data Using Any 2D AI

verfasst von : Hermenegildo Solheiro, Lee Kent, Keisuke Toyoda

Erschienen in: Virtual Reality and Mixed Reality

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the age of spatial computing, computer vision is central, and efficient segmentation of 3D scan data becomes a fundamental task. Existing segmentation methods are often locked to specific AI models, lack level-of-detail (LoD) capabilities, and do not support efficient incremental segmentation. These limitations hinder their application to XR systems that integrate architectural and urban scales, which demand both at scale and detailed, up-to-date segmentation information, while leveraging limited local hardware in distributed computing environments.
In this work, we present a novel framework that integrates multiple 2D AI through AI-agnostic 3D geometry feature fusion, ensuring spatial consistency while taking advantage of the rapid advancements in 2D AI models. Our framework performs LoD segmentation, enabling swift segmentation of downsampled geometry and full detail on needed segments. Additionally, it progressively builds a segmentation database, processing only newly added data, thereby avoiding point cloud reprocessing, a common limitation in previous methods.
In our use case, our framework analyzed a public building based on three scans: a drone LiDAR capture of the exterior, a static LiDAR capture of a room, and a user-held RGB-D camera capture of a section of the room. Our approach provided a fast understanding of building volumes, room elements, and a fully detailed geometry of a requested object, a “panel with good lighting and a view to a nearby building”, to implement an XR activity.
Our preliminary results are promising for applications in other urban and architectural contexts and point to further developments in our Geometric Data Inference AI as a cornerstone for deeper, more accurate Multi-AI integration.

1 Introduction

In the age of spatial computing, X R has become a foundational pillar. X R works on the premise of capturing and referencing reality to extend it. This relies on computer vision, particularly segmentation, which attributes meaning to geometric data, enabling computers to understand and leverage physical space and objects.
At both room and city scales, efficient understanding of spaces, objects, and devices is essential to compatibilize both physical use and content augmentation [13]. However, existing methods for segmenting point cloud data, whether purely geometric or AI-based, have many limitations. The most capable AI methods are often (i) locked to specific AI models, (ii) lack level-of-detail (LoD) capabilities, and (iii) efficient incremental segmentation. This hinders their applicability to XR systems that integrate architectural and urban scales and need up-to-date, meaningful geometric data both at scale and in detail, while (iv) leveraging limited local hardware in distributed computing environments.
In such systems, segmentation technology needs to discern essential from unnecessary information, prioritizing processing needs to enable real-time update. Moreover, the segmentation framework needs to be resilient, to leverage point clouds from different devices, and handle common issues such as heavy cluttered views and point sparsity, while taking advantage of local processing power. As such, it needs to be versatile to adapt the segmentation model to available scan data and hardware, focusing on what is necessary for a specific application.
In this work, we propose a segmentation framework that addresses these challenges by (i) integrating different 2D AI models through AI-agnostic 3D geometry feature fusion, ensuring spatial consistency while leveraging the rapid advancements in 2D AI models. With 3D geometry feature fusion, we put geometry information at the core of our framework. If two segments share the same classification and spatial position at the same time, they are the same thing.
To avoid excessive computational load, geometry is simplified as much as possible for each operation, such as downsampled clouds or segment c-hulls, to ensure speed without compromising accuracy. This geometric simplification is at the base of our (ii) LoD segmentation, starting from very fast segmentation of simplified geometry, and adding detail to needed segments only.
Additionally, 3D geometry feature fusion enables processing only newly added point data, matching spatial position against the simplified geometry of previous segments, thus avoiding the need to reprocess point cloud data, which is key for efficient up-top-date, (iii) incremental segmentation.
Prioritizing segmentation tasks, through LoD, and incremental segmentation, makes it possible to (iv) leverage local hardware of scanning devices, such as smartphones, tablets, or headsets. Eventually allowing low-weight processing offloading to external hardware and saving back-end processing for required heavier tasks.
This paper presents a proof of concept and an initial implementation of the 323 Multi-AI Segmentation Framework, which addresses challenges i, ii, iii and iv, by tackling the limitations of current research identified in the next section.
In this paper, we review the literature (Sect. 2), explain our method (Sect. 3), document our use case (Sect. 4), discuss the results and future research (Sect. 5), and present our conclusions (Sect. 6).

2 Literature Review

Existing point cloud segmentation methods can be grouped into 3D or 2D-based methods.

2.1 3D-Based Methods

  • Geometry-based methods group points into clusters based on geometric relationships and properties, such as distances, normal vectors, or color. They are effective for simple shapes and unobstructed views, such as empty rooms and clear building views [18, 19]. However, their effectiveness diminishes in clustered scenes common in real-world settings, making them better suited for post-processing and refining segments [11].
  • AI-based methods often use neural networks to group points based on patterns learned from their training data. They can be voxel-based, organizing points into a grid to operate voxel analysis, or point-based, directly using point data [4, 15, 16]. Particularly, 3D point-based frameworks with Multi-Layer Perceptrons [8] perform well even for heavily clustered scenes. However, the high point cloud resolution required leads to a heavy computational load [4], and the sparsity of points prevalent in existing scan data can severely degrade performance [2]. These methods are especially suitable to refining segments, [17] or as a back-end method, where previous 2D segmentation data can label points to improve efficiency [2, 17].

2.2 2D-Based Multiview Methods

  • Multiview-based methods extract 2D views from 3D point clouds and use 2D segmentation models, typically Convolutional Neural Networks (CNNss), to identify and match segments. These methods benefit from the speed and quality of 2D image segmentation, but face several challenges:
    1.
    Multiview methods are often locked to specific AI segmentation models or architectures, such as Multi-View Convolutional Neural Networks (MVCNNs) [1, 3, 5, 6, 9, 10, 23, 24], which becomes a great limitation provided recent advances in other AI architectures, such as transformer-based, particularly Large Language Models (LLms);
     
    2.
    Incomplete and/or inaccurate segment detection due to occlusion. Points behind other points or missing in a specific view can be incorrectly classified introducing not negligible errors in the segmentation process [2, 4, 26];
     
    3.
    Limitations in feature fusion with eventual information loss, usually occuring in the interface between single-view and multiview networks in CNNs, which requires view pooling, invariant property seeking, etc. which may omit particularities or view-specific information [4, 22];
     
    4.
    Ineffective for incremental segmentation, as they require re-processing the entire point cloud each time new point data is added. This becomes a limitation to real-time awareness in dynamic scenarios [14]. Although recent research in 3D AI-based models specifically addresses progressive updating of point cloud segmentation [12, 20, 25], to the best of our knowledge, this has not yet been addressed for multi-view methods.
     
Furthermore, a common limitation of AI methods, both 3D and 2D, is not fully exploring the geometric information present in the data [4, 8]. Including depth data in RGB segmentation partially addresses the issue [26]. However, spatial relationships are not fully analyzed by CNNs [26], which typically look for patterns in the pixel data and treat depth as an additional channel. Recent research complements semantic segmentation with geometric analysis [8]. However, for 2D AI-based methods, the gap between semantic and geometric analysis remains [4].
In this work, we propose 3-2-3 Multi-AI Segmentation Framework which fully leverages geometric information with geometric feature fusion, enabling to address the limitations described above, as explained in the next section.

3 3-2-3 Multi-AI Segmentation Framework

In this section, first, we present our framework; second, we emphasize its main features and how these address the limitations identified in Sect. 2. Our framework follows the steps illustrated in Fig. 1, as described below.
1.
Point cloud pre-processing:
  • Registers and downsamples point cloud substantially (down to 0.005% of the original), according to the area covered and required accuracy (e.g., minimum distance between points of 500mm or 10mm, for urban scale or room close-ups, respectively) (Fig. 1-1.1, 1.2).
  • Renders point cloud, adjusts visualization and extracts 2D views from original cloud (Fig. 1-1.3, 1.4).
 
2.
View-level segmentation (single viewpoint):
  • Segments 2D views of the original cloud, using any 2D AI that takes 2D images and outputs pixel maps identifying the segments (Fig. 1-2.1, 2.2);
  • Projects the downsampled cloud to 2D (keeping point order) and defines a set of point indices per segment, calculating intersections between projected points and segment polygons (Fig. 1-2.3, 2.4);
  • References corresponding 3D points by index and calculates segment’s geometric descriptors based on the downsampled cloud (e.g., volume, bounding box, c-hull), for later accurate 3D analysis with minimal 3D geometry (Fig. 1-2.5, 2.6);
  • Validates and saves segments as minimal data, and discards not validated ones (Fig. 1-2.7, 2.8).
 
3.
Scan-level segmentation (integrating multiple viewpoints):
  • Performs geometric feature fusion, finding segments sharing the same 3D space and fusing the matching ones, by analyzing relationships between geometric descriptors. Unmatched segments remain separate (Fig. 1-3.1).
  • Integrates sets of point indices of fused segments and update geometric descriptors (Fig. 1-3.2).
  • Validates and saves fused segments, and discards not validated ones (Fig. 1-3.3, 3.4).
 
4.
Space-level segmentation (integrating multiple scans):
  • Performs geometric feature fusion, matching and fusing scan-level segments from different scans, using geometric descriptors (Fig. 1-4.1).
  • Integrates sets of point indices of fused segments and updates geometric descriptors (Fig. 1-4.2).
  • Validates fused segments and discards not validated ones (Fig. 1-4.2).
  • Adds detail per segment if requested, by accessing the original cloud (Fig. 1-4.3).
  • Achieves segmentation of entire space and refines progressively as segments are updated with new scans and views. The outputs can be requested per segment, corresponding to point data, convex hulls or reconstructed surfaces, with different LoD and in different formats such as GML, JSON, IFC(BIM), etc. (Fig. 1-4.4).
 
Our proposed framework places geometric information at the core of the feature fusion method. If two segments share the same classification and spatial position at the same time, they are considered to be the same. We fuse features based on their spatio-temporal position in two stages: between views in the same scan (Fig. 1-3.1), and between scans (Fig. 1-4.1).
To avoid computational load, geometry is simplified as much as possible for each operation (e.g., 2D intersection, downsampled cloud, segment c-hull, etc.) to ensure speed without compromising accuracy. The point indices of each segment reference both the downsampled and the original point clouds. The downsampled cloud is accessed when a 2D projection is needed for a new view (Fig. 1-2.3), while the original point cloud is accessed only if a higher LoD is required for a specific segment, thus optimizing resource usage (Fig. 1-4.3). The analysis of geometric properties is supported by Geometric Data Inference AI, which allows for fast and traceable control of the segmentation process [21]. This geometric feature fusion method, addresses the limitations previously identified in multiview methods. In particular:
1.
Whereas multiview segmentation methods are often locked to specific AI segmentation models, our geometric feature fusion is AI model independent. This allows our framework to use not only CNNs, but other segmentation technologies such as transformer-based models, namely multimodal LLMs. As such, the AI used in Fig. 1-2.1 can be any AI capable of identifying segments in 2D images and generating segmentation masks. Similarly, (Fig. 1-1.2) to (Fig. 1-3.3) can be abbreviated to work with 3D AI instead;
 
2.
Regarding occlusion, our method works with single scan point clouds, and camera views are taken from the scan source position, eliminating the issue of occluded points, and mitigating the problem of point sparsity, that particularly affects other camera positions;
 
3.
Regarding loss of information on feature fusion, our 3D geometry feature fusion method minimizes this problem by directly connecting 2D segments with 3D geometry, integrating both common information and view-specific data;
 
4.
Regarding incremental segmentation, our method can identify segments based only on the new data, and fuse them with previous segments in a scene without duplicating the segmentation task.
 
In the next section, we illustrate our implementation and results on processing scan data from a public building.

4 Use Case

In our use case, we segment scan data from a public building using three point clouds: (I) a 360\(^\circ \) LiDAR drone capture of the building and the surrounding area, (II) a 360\(^\circ \) LiDAR capture of the interior of a room, and (III) a point cloud captured by a user-held RGB-D camera of a section of the room. We request full-resolution point data for a “panel with good lighting and a view to a nearby building” to showcase the usefulness for eventual implementation of an XR activity (Fig. 2).

4.1 Point Cloud Pre-processing (single Viewpoint)

Three e57 files containing previously calibrated point cloud data were used for our test. Scan I, a 360\(^\circ \) LiDAR drone capture of the building and the surrounding area with +20 million points, Scan II, a 360\(^\circ \) LiDAR capture of the room with +11million points, and Scan III, a user-held RGB-D camera capture with +2 million points, were processed into coordinate and color data, using python’s e57 module (Fig. 2-A).
Each point cloud was downsampled for a minimum distance between points of 500mm, 100mm, and 50mm, for Scans I, II, III respectively, while point index correspondence with each original cloud was kept, using C++ and PCL. Point clouds were rendered and 3D visualization was optimized for improved AI recognition, using C++ and VTK (Fig. 2-B.1).
All views were extracted from the sensor position. In Scan I, only one user-defined view was extracted. In Scan II, the first view was user-defined, and subsequent ones were automatically determined to cover the 360\(^\circ \) cloud span. In Scan III, the view corresponding to the RGB-D camera view was extracted.
For each 2D view, image data and camera information, such as position, aperture, focus, etc. were saved for geometric validation and integration at scan and space levels, using C++ and VTK.

4.2 Per View Data Processing

Several AI models capable of segmenting 2D images were tested, including: One Former, trained on the ADE20K dataset [7, 27, 28], Meta’s Segment Anything, and LLMs, specifically, Google Gemini and ChatGPT-4o. The LLMs benefited from contextual multi-view prompts, allowing segment recognition in relatively obstructed views (views 3 and 4). However, they delivered inconsistent results and highly variable response times throughout the day, sometimes not answering and interrupting the process. Meta’s Segment Anything, had the best geometry recognition in our test data. However, being class-agnostic and with processing times 2.5x longer than the average and up to 6x longer than One Former, it is sub-optimal for direct application in our framework. As such, One Former was chosen for our use case due to its fast, consistent accuracy on our scan data (Fig. 3).
Segmentation masks were extracted from each 2D view as pixel maps, using pytorch and One Former and processed for contour optimization, smoothing, and offsetting, using C++ and OpenCV (Fig. 2-B.2). The downsampled cloud was projected to the 2D view keeping point order, using C++ and VTK. Intersections of 2D points with the segment polygons were tested, and segments were directly referenced in the 3D downsampled cloud, using the C++ standard library (Fig. 2-C).
Geometric descriptors, such as convex hull, centroid, etc., were calculated per segment, using C++ and CGAL. 3D segments were validated, by testing the detected geometry against an object template (Fig. 4). The validated segments were stored as minimal data, namely sets of point indices, geometric descriptors, and segment metadata added byOne Former, such as class name and predicted accuracy, etc., for later feature fusion across multiple views and scans.

4.3 Per Scan Data Processing (multiple Views)

On Scan II, our framework performed segmentation on multiple views. As such, the segments identified in each view were consolidated through 3D geometry-based feature fusion.
Matching segments were fused, and the geometric descriptors were updated and saved as minimal data. Fused segments were validated with relational tests, both physical and contextual (Fig. 4). Non-validated segments were discarded.
In this process, geometry was never duplicated, it was kept only in the original and the downsampled clouds. Only a 2D projection was calculated once per view, and geometric descriptors were calculated per segment, referencing the downsampled cloud. Segments per view were stored for incremental consolidation and detailing at space level.

4.4 Space-Level Segmentation (multiple Scans)

The relative positions of Scans I, II and III were identified and matching segments fused. The consolidated segments were validated with relational tests at the space level, and the geometric descriptors were updated (Fig. 2-D).
Case-specific tests were introduced, to find a “panel with good lighting and a view to a nearby building”. Namely, the illumination score was measured, based on the distances to the window and ceiling lights. Additionally, the view of a nearby building was assessed by simplified ray-casting, assuring there is at least one ray that originates in the panel (Scans II and III) and hits the nearby building (Scan I), while otherwise hitting only transparent elements, namely the window. Detail was added for the selected panel, where full resolution was requested. In this case, the c-hull of the consolidated segment was used to check for intersection with the original point cloud, allowing retrieval of the full resolution point data (Fig. 2-E).
Our framework can update the segmentation data indefinitely. Each consolidated segment identifies all the integrated segments and the respective scan and view. Currently, the output is saved as obj files, corresponding to the convex hulls or reconstructed geometry of the consolidated segments, and txt files containing the origin and metadata of each segment.
Preliminary comparison to a baseline 3D geometric method, using PCL’s RANSAC and Region Growing algorithms, reveals that while the geometric method struggles due to occlusion in the single scan point clouds, our framework maintains good classification accuracy across the different scans.
Regarding processing times, even before optimization, Scan I demonstrates the advantage of our framework especially on large point clouds, due to the substantial downsampling (4.1). Scan II represents the least favourable scenario for our framework, the initial segmentation of the entire room, which requires segmenting 4 views, increasing processing times. However, this is balanced by later scans, such as Scan III, where our framework processes only the newly added data, unlike the geometric method, which requires the entire point cloud every time (Fig. 5).
Our framework enables quick and accurate identification of the 3D geometry of the main elements both in large exterior scenes and at room-level environments, facilitating the retrieval of detailed geometry of requested elements, which can be referenced by XR applications.

5 Discussion

Regarding the challenges initially identified, our preliminary results indicate that the framework effectively addresses common limitations in existing segmentation methods. (i) The ability to test with different segmentation models demonstrates its flexibility to take advantage of future advances in segmentation technology. Geometry-based feature fusion allows (ii) LoD segmentation, saving computational load only for requested elements, thus enhancing computational efficiency. It also allows (iii) incremental segmentation, which is efficient in avoiding the reprocessing of point data, and especially advantageous for a progressive, multiscan environment. Our preliminary comparative study indicates processing times hand-in-hand with a baseline 3D geometric method, and markedly superior segment recognition (Fig. 5). Further computational optimization such as multi-threading, shall allow real-time compatible performance. This computational efficiency suggests (iv) suitability for on-device processing in a distributed computing environment.
Although our results are promising, key issues were identified for future development.
  • View selection and segment accuracy. In relatively obstructed views (e.g., with objects too close to the sensor), segment classification accuracy tended to decrease. Capturing views with wider aperture angles mitigated this issue with One Former, while LLMs, benefited from contextual multiview prompts. However, improving the view selection algorithm, based on preliminary analysis of similar views and opting for views with the highest predicted accuracy, remains relevant to minimize the number of necessary views and erroneous results, especially for one-image-at-a-time segmentation models.
  • 2D AI articulation. Future developments in 2D AI models are expected to improve segmentation accuracy beyond what we assessed in our tests (Fig. 3). Enabling switching AI models depending on view characteristics (exterior or interior scenes, full objects or parts, etc.) can also improve segmentation results. A detailed benchmark of different AI models for different scenes should provide further insight on the relative advantages and drawbacks of each model and guide their articulation within the Multi-AI segmentation framework. While relying on 2D AI-based segmentation is an obvious advantage to be run locally in systems with limited resources, benchmarking different segmentation strategies in more powerful setups is important to gauge the relative advantage of utilizing 3D AI over 2D AI models, particularly in the back-end. In this context, the 3-2-3 Multi-AI Framework can handle 2D-based segmentation as preprocessing, labeling point data before feeding it into 3D AI, thereby enhancing the efficiency of 3D-based segmentation.
  • 2D-3D Reprojection and point cloud quality. In the process of 2D-3D reprojection, the quality of the raw clouds can be an obstacle. In the used hand-held RGB-D camera, Azure Kinect, distortions were present which can lead to erroneous geometry. Our framework intersects 2D segments with the downsampled version of the cloud, and uses symbolic AI-based post-validation, minimizing this issue. However, many elements failed validation and were discarded, representing missed opportunities for data leveraging. Although evolving scanning technology should mitigate the quality issues of raw clouds in the future, incorporating lightweight, scanner-specific preprocessing steps in our framework should be considered to reduce errors and improve 3D geometry accuracy.
  • Geometric Data Inference AI
    • 3D Geometry Feature Fusion relies on 3D positioning, presenting challenges which are common when tracking moving objects, but also when dealing with static elements. Our current feature fusion requires classification of the same points across different views, which can be challenging for architectural elements or large objects. For example, a large table in which only the two extremes are captured by different views will be identified as two separate objects. Our framework saves segment data per view and allows integration with iterative segmentation. However, elaborating feature fusion algorithms to handle spatial discontinuity, is essential to make the best use of available data, such as to reconstruct missing geometry or to guide necessary additional scans.
    • Geometry Validation. When only partial geometry is present, testing against an object template, often fails validation and ultimately leads to the loss of valuable information. To handle this, our algorithm adapts the testing depending on full or partial geometry (Fig. 4), not restricted to geometric descriptors of point data but also considering elements such as number of views, view cones, relative positions, etc. We adapted validation algorithms for common elements of the furniture, achieving good results in our test data. However, efficiently distinguishing between erroneous and valid but incomplete data is an essential point for further development.
    • Spatial Awareness. Our simplified raycasting and naive illumination score allowed us to find a panel with good lighting and a view to a nearby building, and add high-res data from a mobile scan. This is an example of using low-level 2D segments, through geometric analysis, to create high-level spatial awareness that shall be further developed. Currently, our Geometric Data Inference AI only employs symbolic AI. However, a hybrid symbolic-neural strategy could enhance explicit segmentation control with improved resilience and adaptability. By integrating our low-level segmentation data with transformer-based architectures, such as LLMs, we can develop efficient high-level spatial intelligence.

6 Conclusion

Our proposed framework demonstrates promising results in overcoming the limitations of previous studies and addressing challenges in point cloud segmentation for XR systems that integrate architectural and urban scales.
The framework (i) integrates various 2D AI segmentation models through geometric feature fusion. (ii) It supports LoD segmentation, enabling rapid segmentation of simplified geometry while providing full-detail segments when necessary. By (iii) incrementally updating the segmentation database, processing only newly added data, our framework avoids reprocessing the entire point cloud, thus improving computational efficiency and (iv) allowing the use of local hardware in a distributed computing environment.
Some issues identified in the Discussion (Sect. 5), such as view selection and segment accuracy, are likely to be mitigated by advances in 2D AI models. LLMs, namely ChatGPT-4o, can contextually understand even occluded views when provided alongside more illustrative ones. With better control over the output format, these models might substantially mitigate the issue of view selection and segment accuracy.
Our Geometric Data Inference AI is the cornerstone of our Multi-AI integration. Focusing on geometric integration allows our framework to be 2D AI-agnostic, flexible enough to leverage segmentation models of different etiologies to enhance spatial awareness, adapting based on scene characteristics, available data, and local hardware. Further developing these algorithms is essential to ensure a high-quality, resilient segmentation framework, which by binding geometric relationships and their meaning can become the foundation for a multipurpose spatial AI, capable of unlocking the potential of segmentation data to extend and enhance the human spatial experience.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Literatur
1.
Zurück zum Zitat Chen, X., Sun, Y., Song, S., Jia, J.: Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In: European Conference on Computer Vision (ECCV), pp. 561–577 (2020) Chen, X., Sun, Y., Song, S., Jia, J.: Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In: European Conference on Computer Vision (ECCV), pp. 561–577 (2020)
2.
Zurück zum Zitat Dhakal, S., Carrillo, D., Qu, D., Nutt, M., Yang, Q., Fu, S.: Virtualpainting: addressing sparsity with virtual points and distance-aware data augmentation for 3d object detection (2023) Dhakal, S., Carrillo, D., Qu, D., Nutt, M., Yang, Q., Fu, S.: Virtualpainting: addressing sparsity with virtual points and distance-aware data augmentation for 3d object detection (2023)
3.
Zurück zum Zitat Fooladgar, F., Kasaei, S.: Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images. arXiv preprint arXiv:1912.11691 (2019) Fooladgar, F., Kasaei, S.: Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images. arXiv preprint arXiv:​1912.​11691 (2019)
4.
Zurück zum Zitat Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3d point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43(12), 4338–4364 (2020)CrossRef Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep learning for 3d point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43(12), 4338–4364 (2020)CrossRef
5.
Zurück zum Zitat Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Asian Conference on Computer Vision (ACCV), pp. 213–228 (2016) Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Asian Conference on Computer Vision (ACCV), pp. 213–228 (2016)
6.
Zurück zum Zitat Hu, X., Yang, K., Fei, L., Wang, K.: Acnet: attention based network to exploit complementary features for RGBD semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1440–1444. IEEE (2019) Hu, X., Yang, K., Fei, L., Wang, K.: Acnet: attention based network to exploit complementary features for RGBD semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1440–1444. IEEE (2019)
7.
Zurück zum Zitat Jain, J., Li, J., Chiu, M., Hassani, A., Orlov, N., Shi, H.: Oneformer: one transformer to rule universal image segmentation (2022) Jain, J., Li, J., Chiu, M., Hassani, A., Orlov, N., Shi, H.: Oneformer: one transformer to rule universal image segmentation (2022)
9.
Zurück zum Zitat Jiang, J., Zheng, L., Luo, F., Zhang, Z.: Rednet: residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054 (2018) Jiang, J., Zheng, L., Luo, F., Zhang, Z.: Rednet: residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:​1806.​01054 (2018)
10.
Zurück zum Zitat Lee, S., Kim, S., Lee, T.H., Lee, S., Kim, I.S.K.: Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In: International Conference on Computer Vision (ICCV), pp. 4990–4999 (2017) Lee, S., Kim, S., Lee, T.H., Lee, S., Kim, I.S.K.: Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In: International Conference on Computer Vision (ICCV), pp. 4990–4999 (2017)
11.
Zurück zum Zitat Lyu, X., Chang, C., Dai, P., Sun, Y.T., Qi, X.: Total-decom: decomposed 3d scene reconstruction with minimal interaction (2024) Lyu, X., Chang, C., Dai, P., Sun, Y.T., Qi, X.: Total-decom: decomposed 3d scene reconstruction with minimal interaction (2024)
12.
Zurück zum Zitat McCool, R., et al.: Frame: fast and robust autonomous 3d point cloud map-merging for egocentric multi-robot exploration. arXiv preprint arXiv:2301.09213 (2023) McCool, R., et al.: Frame: fast and robust autonomous 3d point cloud map-merging for egocentric multi-robot exploration. arXiv preprint arXiv:​2301.​09213 (2023)
13.
Zurück zum Zitat Miyake, Y., Toyoda, K., Kasuya, T., Hyodo, A., Seiki, M.: Proposal for the implementation of spatial common ground and spatial AI using the SSCP (spatial simulation-based cyber-physical) model. In: IEEE International Smart Cities Conference, ISC2 2023, Bucharest, Romania, 24–27 September 2023, pp. 1–7. IEEE (2023). https://doi.org/10.1109/ISC257844.2023.10293487 Miyake, Y., Toyoda, K., Kasuya, T., Hyodo, A., Seiki, M.: Proposal for the implementation of spatial common ground and spatial AI using the SSCP (spatial simulation-based cyber-physical) model. In: IEEE International Smart Cities Conference, ISC2 2023, Bucharest, Romania, 24–27 September 2023, pp. 1–7. IEEE (2023). https://​doi.​org/​10.​1109/​ISC257844.​2023.​10293487
14.
Zurück zum Zitat Pan, L., et al.: Multi-view partial (MVP) point cloud challenge 2021 on completion and registration: methods and results (2021) Pan, L., et al.: Multi-view partial (MVP) point cloud challenge 2021 on completion and registration: methods and results (2021)
15.
Zurück zum Zitat Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593 (2017) Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:​1612.​00593 (2017)
16.
Zurück zum Zitat Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural. Inf. Process. Syst. 30, 5099–5108 (2017) Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural. Inf. Process. Syst. 30, 5099–5108 (2017)
17.
Zurück zum Zitat Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from rgb-d data (2018) Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from rgb-d data (2018)
18.
Zurück zum Zitat Rusu, R.B., Cousins, S.: 3d is here: Point cloud library (PCL). In: 2011 IEEE International Conference on Robotics and Automation, pp. 1–4 (2011) Rusu, R.B., Cousins, S.: 3d is here: Point cloud library (PCL). In: 2011 IEEE International Conference on Robotics and Automation, pp. 1–4 (2011)
20.
Zurück zum Zitat Shi, W., Rajkumar, R.R.G.: Point-gnn: Graph neural network for 3d object detection in a point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1711–1719 (2020) Shi, W., Rajkumar, R.R.G.: Point-gnn: Graph neural network for 3d object detection in a point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1711–1719 (2020)
22.
Zurück zum Zitat Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015) Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)
23.
Zurück zum Zitat Valada, A., Mohan, R., Burgard, W.: Self-supervised model adaptation for multimodal semantic segmentation. Int. J. Comput. Vis. (IJCV) (2019) Valada, A., Mohan, R., Burgard, W.: Self-supervised model adaptation for multimodal semantic segmentation. Int. J. Comput. Vis. (IJCV) (2019)
24.
Zurück zum Zitat Xing, Y., Wang, J., Chen, X., Zeng, G.: Coupling two-stream rgb-d semantic segmentation network by idempotent mappings. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1850–1854. IEEE (2019) Xing, Y., Wang, J., Chen, X., Zeng, G.: Coupling two-stream rgb-d semantic segmentation network by idempotent mappings. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1850–1854. IEEE (2019)
25.
Zurück zum Zitat Zhang, H., et al.: Pointmbf: a multi-scale bidirectional fusion network for unsupervised rgb-d point cloud registration. arXiv preprint arXiv:2308.04782 (2023) Zhang, H., et al.: Pointmbf: a multi-scale bidirectional fusion network for unsupervised rgb-d point cloud registration. arXiv preprint arXiv:​2308.​04782 (2023)
26.
Zurück zum Zitat Zhong, Y., Dai, Y., Li, H.: 3d geometry-aware semantic labeling of outdoor street scenes. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2343–2349. IEEE (2018) Zhong, Y., Dai, Y., Li, H.: 3d geometry-aware semantic labeling of outdoor street scenes. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2343–2349. IEEE (2018)
27.
Zurück zum Zitat Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017) Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
28.
Zurück zum Zitat Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision 127, 302–321 (2019)CrossRef Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision 127, 302–321 (2019)CrossRef
Metadaten
Titel
3-2-3 Multi-AI Segmentation Framework: LoD-Based, Incremental Segmentation of 3D Scan Data Using Any 2D AI
verfasst von
Hermenegildo Solheiro
Lee Kent
Keisuke Toyoda
Copyright-Jahr
2025
DOI
https://doi.org/10.1007/978-3-031-78593-1_8