A System for an Accurate 3D Reconstruction in Video Endoscopy Capsule
© Anthony Kolar et al. 2009
Received: 15 March 2009
Accepted: 12 October 2009
Published: 30 November 2009
Skip to main content
© Anthony Kolar et al. 2009
Received: 15 March 2009
Accepted: 12 October 2009
Published: 30 November 2009
Since few years, the gastroenterologic examinations could have been realised by wireless video capsules. Although the images make it possible to analyse some diseases, the diagnosis could be improved by the use of the 3D Imaging techniques implemented in the video capsule. The work presented here is related to Cyclope, an embedded active vision system that is able to give in real time both 3D information and texture. The challenge is to realise this integrated sensor with constraints on size, consumption, and computational resources with inherent limitation of video capsule. In this paper, we present the hardware and software development of a wireless multispectral vision sensor which allows to transmit, a 3D reconstruction of a scene in realtime. multispectral acquisitions grab both texture and IR pattern images at least at 25 frames/s separately. The different Intellectual Properties designed allow to compute specifics algorithms in real time while keeping accuracy computation. We present experimental results with the realization of a large-scale demonstrator using an SOPC prototyping board.
Examination of the whole gastrointestinal tract represents a challenge for endoscopists due to its length and inaccessibility using natural orifices. Moreover, radiologic techniques are relatively insensitive for diminutive, flat, infiltrative, or inflammatory lesions of the small bowel. Since 1994, video capsules (VCEs) [1, 2] have been developed to allow direct examination of this inaccessible part of the gastrointestinal tract and to help doctors to find the cause of symptoms such as stomach pain, disease of Crohn, diarrhoea, weight loss, rectal bleeding, and anaemia.
The Pillcam video capsule designed by Given Imaging Company is the most popular of them. This autonomous embedded system allows acquiring about 50 000 images of gastrointestinal tract during more than twelve hours of an analysis. The off-line image processing and its interpretation by the practitioner permit to determine the origin of the disease. However, recent benchmark  published shows some limitations on this video capsule as the quality of images and the inaccuracy on the size of the polyps. Accuracy is a real need because the practitioner makes an ablation of a polyp only if it exceeds a minimum size. Actually the polyp size is estimated by practitioner's experience with more or less error for one practitioner to another. One of the solutions could be to use techniques of 3D imagery, either directly in the video capsule or on a remote computer.
This later solution is actually used in the Pillcam capsule by using the 2–4 images that are taken per second and stored wirelessly in a recorder that is worn around the waist. 3D processing is performed off-line from the estimation of the displacement of the capsule. However, the speed of video-capsule is not constant; for example, in the oesophagus, it is of 1.44 m/s, and in the stomach it is almost null and is 0.6 m/s in the intestine. Consequently, by taking images at frequencies constant, certain areas of the transit will not be rebuilt. Moreover, the regular transmission of the images by the body consumes too much energy and limits the autonomy of the video capsules to 10 hours. Ideally, the quantity of information to be transmitted must be reduced at the only pertinent information like polyps or other 3D objects. The first development necessary to the delivery of such objects relies on the use of algorithm of pattern recognition on 3D information inside the video capsule.
The introduction of 3D reconstruction techniques inside a video capsule needs to define a new system that takes into account the hard constraints of size, low power consumption, and processing time. The most common 3D reconstruction techniques are those based on passive or active stereoscopic vision methods, where image sensors are used to provide the necessary information to retrieve the depth. Passive method consists of taking at least two images of a scene at two different points of view. Unfortunately using this method, only particular points, with high gradient or high texture, can be detected . The active stereo-vision methods offer an alternative approach when processing time is critical. They consist in replacing one of the two cameras by a projection system which delivers a pattern composed by a set of structured rays. In this latter case, only an image of the deformation of the pattern by the scene is necessary to reconstruct a 3D image. Many implementations based on active stereo-vision have been realised in the past [5, 6] and provided significant results on desktop computers. Generally, these implementations have been developed to reconstruct 3D large objects as building [7–14].
In our research work, we have focused on an integrated 3D active vision sensor: "Cyclope." The concept of this sensor was first described in . In this new article we focus on the presentation of our first prototype which includes the instrumentation and processing blocks. This sensor allows making in real time a 3D reconstruction taking into account the size and power consumption constraints of embedded systems . It can be used in wireless video capsules or wireless sensor networks. In the case of video capsule in order to be comfortable for the patient, the results could be stored in a recorder around the waist. It is based on a multispectral acquisition that must facilitate the delivery of a 3D textured reconstruction in real time (25 images by second).
This paper is organised as follows, Section 2 describes briefly Cyclope and deals with the principles of the active stereo-vision system and 3D reconstruction method. In Section 3 we present our original multispectral acquisition. In Section 4 we present the implementation of the optical correction developed to correct the lens distortion. Section 5 deals with the implementation of a new thresholding and labelling methods. In Sections 6 and 7, we present the processing of matching in order to give a 3D representation of the scene. Section 8 deals with wireless communication consideration. Finally, before a conclusion and perspectives of this work, we present, in Section 9, a first functional prototype and its performances which attest the feasibility of this original approach.
(i)Instrumentation block: it is composed of a CMOS camera and a structured light projector on IR band.
(ii)Processing block: it integrates a microprocessor core and a reconfigurable array. The microprocessor is used for sequential processing. The reconfigurable array is used to implement parallels algorithms.
(iii)RF block: it is dedicated for the OTA (Over the Air) communications.
The feasibility of Cyclope was studied by an implementation on an SOPC (System On Programmable Chip) target. These three parts will be realised in different technologies: CMOS for the image sensor and the processing units, GaAs for the pattern projector, and RF–CMOS for the communication unit. The development of such integrated "SIP" (System In Package) is actually the best solution to overcome the technological constraints and realise a chip scale package. This solution is used in several embedded sensors such as The "Human++" platform  or Smart Dust .
The basic principle of 3D reconstruction is the triangulation. Knowing the distance between two cameras (or the various positions of the same camera) and defining of the line of views, one passing by the center of camera and the other by the object, we can find the object distance.
The active 3D reconstruction is a method aiming to increase the accuracy of the 3D reconstruction by the projection on the scene of a structured pattern. The matching is largely simplified because the points of interest in the image needed to the reconstruction are obtained by the extraction of the pattern; it also has the effect to increase the speed of processing.
(i)the line of sight, passing through the pattern point on the scene and its projection in the image plan,
(ii)the laser ray, starting from the projection center and passing through the chosen pattern point.
If we consider the active stereoscopic system as shown in Figure 3, where is the projection of in the image plan and the projection of on the camera plan , the projection of the light ray supporting the dot on the image plan is a straight line. This line is an epipolar line [18–20].
To rapidly identify a pattern point on an image we can limit the search to the epipolar lines.
For Cyclope the pattern is a regular mesh of points. For each point of the pattern we can find the corresponding epipolar line:
where are the image coordinates and the parameters are estimated through an off-line calibration process.
In addition to the epipolar lines, we can establish the relation between the position of a laser spot in the image and its distance to the stereoscopic system.
By considering the two triangles and , we can express as
where is the stereoscopic, the focal length of the camera and the distance in pixels:
Given the epipolar line we can express as a function of only one image coordinates:
From (2) and (4), we can express, for each pattern point , the depth as a hyperbolic function:
where the and parameters are also estimated during the off-line calibration of the system .
We can compute the inverse of the depth to simplify the implementation. Two operations are only needed: an addition and a multiplication. The computation of the depth of each point is independent of the others. So, all the laser spots can be computed separately allowing the parallelisation of the architecture.
(1)the multispectral acquisition which makes the discrimination between the pattern and the texture by an energetic method;
(2)the correction of the error coordinates due to the optical lens distortion;
(3)the processing before the 3D reconstruction as thresholding, segmentation, labelling, and the computation of the laser spot center;
(4)the computation of the matching and the third dimension;
(5)the transmission of the data with a processor core and an RF module.
The combination of the acquisition of the projected pattern on the infrared band, the acquisition of the texture on the visible band, and the mathematical model of the active 3D sensor makes it possible to restore the 3D textured representation of the scene. This acquisition needs to separate texture and 3D datas. For this purpose we have developed a multispectral acquisition . Generally, filters are used to cut the spectral response. We used here an energetic method, which has the advantage of being generic for imagers.
The typical values are
Generally, the lenses used in the VCE introduce large deformations on acquired images because of their weak focal . This distortion is manifested in inadequate spatial relationships between pixels in the image and the corresponding points in the scene. Such change in the shape of captured object may have critical influence in medical applications, where quantitative measurements in Endoscopy depend on the position and orientation of the camera and its model. The used camera model needs to be accurate. For this reason we introduce firstly the pinhole camera model and later the correction of geometric distortion that are added to enhance it. For practical purposes two different methods are studied to implement this correction, and it is up to researchers to choose their own model depending on their required accuracy level and computational cost.
The effective distortion can be modelled by
where represent radial distortion , represent decentering distortion, and represent thin prism distortion. Assuming that only the first- and second-order terms are sufficient to compensate the distortion, and the terms of order higher than three are negligible, we obtain a fifth-order polynomials camera model (expression 8), where are the distorted image coordinates in pixels, and are true coordinates (undistorted):
An approximation of the inverse model is done by (11):
The unknown parameters are solved using direct least mean-squares fitting  in the off-line calibration process.
After the computation of parameters in (11) through an off-line calibration process, we used them to correct the distortion of each frame. With the input frame captured by the camera denoted as the source image and the corrected output as the target image, the task of correcting the source distorted image can be defined as follows: for every pixel location in the target image, compute its corresponding pixel location in the source image. Two implementation techniques of distortion correction have been compared:
Calculate the image coordinates through evaluating the polynomials to determine intensity values for each pixel.
Calculate the image coordinates through evaluating the polynomials correction in advance, storing them in a lookup table which is referenced at run-time. All parameters needed for LUT generation are known beforehand; therefore for our system, the LUT is computed only once and off-line.
However, since the source pixel location can be a real number, using it to compute the actual pixel values of the target image requires some form of pixel interpolation. For this purpose we have used the nearest neighbour interpolation approach that means that the pixel value closest to the predicted coordinates is assigned to the target coordinates. This choice is reasonable because it is a simple and fast method for computation, and visible image artefacts have no subject with our system.
Performance results of these two techniques are presented in terms of (i) execution time and (ii) FPGA logic resource requirements.
Area and Clock characteristics of two approaches.
Look Up Table
The execution time for the direct computation implementation is comparatively very slow. This is due to the fact that the direct computation approach consumes a much greater amount of logic resources than the Look-up Table approach. Moreover the slow clock cycle (10 MHz) could be increased by splitting the complex arithmetic logic into several smaller stages. The significant difference between these two approaches is that the direct computation approach requires more computation time and arithmetic operations, while the LUT approach requires more memory accesses and more RAM Blocks occupation. Regarding latency, both approaches can be executed with respect to real-time constraint of video cadence (25 frames per second). Depending on the applications, the best compromise between time and resources must be chosen by the user. For our application, arithmetic operations are intensively needed for later stages in the preprocessing block, while memory blocks are available; so we chose to use the LUT approach to benefit in time and resources.
After lens distortion correction, the laser spots projected must be extracted from the gray level image for delivering a 3D representation of the scene. Laser spots on the image appear with variable sizes (depending on the absorption of the surface and the projection angle). At this level, a preprocessing block has been developed and hardware implemented to make an adaptive thresholding in order to give a binary image and a labelling to classify each laser spot to compute later their center.
Several methods exist from a static threshold value defined by user up to dynamic algorithm as Otsu method .
(i)building the histogram of grey-level image,
(ii)finding the first maxima of the Gaussian corresponding to the Background; compute its mean and standard deviation ,
(iii)calculating the threshold value with (13):
where is an arbitrary constant. A parallel architecture of processing has been designed to compute the threshold and to give a binary image. Full features of this implementation are given in .
After this first stage of extraction of spot laser from the background, it is necessary to classify each laser spot in order to compute separately their center. Several methods have been developed in the past. We chose to use a classical two passes component connected labeling algorithms with an 8-connectivity. We designed a specific optimized Intellectual Property in VHDL. This intellectual property uses fixed point number.
The threshold and labeling processes applied to the captured image allow us to determine the area of each spot (number of pixels). The coordinates of center of these spots could be calculated as follows:
where and and the abscissa and ordinate of th spot center. and are the coordinates of pixels constructing the spot. is the number of pixels of th spot (area in pixels).
To obtain an accuracy 3D reconstruction, we need to compute the spots centers with higher possible precision without increasing the total computing time to satisfy the real-time constraint. The hardest step in center detection part is the division operations in (14). Several methods exist to solve this problem.
The simplest method is the use of hardware divider but they are computationally expensive and consume a considerable amount of resources. This not acceptable for a real-time embedded systems. Some other techniques are used to compute the center of laser spots avoiding the use of hardware dividers.
Some studies suggest approximation methods to avoid implementation of hardware dividers. Such methods like that implemented in  replace the active pixels by the smallest rectangle containing this region and then replace the usual division by simple shifting (division by 2):
This approach is approximated in (15), where are the active pixel coordinates, and are the approximated coordinates of the spot center.
The area of each spot (number of pixels) is always a positive integer, while its value is limited in a predeterminate interval where and are, respectively, the minimum and maximum areas of laser spot in the image. The spot areas depend on object illumination, distance between object and camera, and the angle of view of the scene. Our method consists in a memorisation of where represent the spot pixel number and can take value in . represent the maximum considered size, in pixels, of a spot.
In this case we need only to compute a multiplication, that is resume here:
The implementation of such a filter is very easy, regarding that the most of DSP functions are provided for earlier FPGAs. For example, Virtex-II architecture  provides an bits Multiplier with a latency of about 4.87 ns at 205 MHz and optimised for high-speed operations. Additionally, the power consumption is lower compared to a slice implementation of an 18-bit by 18-bit multiplier . For luminous spots source, the number of operations needed to compute the centers coordinates is , and is the average area of spots. When implementing our approach to Virtex II Pro FPGA (XC2VP30), it was clear that we gain in execution time and size. Comparison of different implementation approaches is described in the next section.
The set of parameters for the epipolar and depth models are used during run time to make point matching (identify the original position of a pattern point from its image) and calculate the depth using the coordinates of each laser spot center.
Starting from the point abscissa we calculate its estimated ordinate if it belongs to a epipolar line. We compare this estimation with the true ordinate .
These operations are made for all the epipolar line simultaneously. After thresholding the encoder returns the index of the corresponding epipolar line.
The next step is to calculate the coordinate from the coordinate and the appropriate depth model parameters
These computation blocs are synchronous and pipelined, allowing, thus, high processing rates.
In this bloc the estimated ordinate is calculated . The parameters are loaded from memory.
In this bloc the absolute value of the difference between the ordinate and its estimation is calculated. This difference is then thresholded.
The thresholding avoids a resource consuming sort stage. The threshold was a priori chosen as half the minimum distance between two consecutive epipolar lines. The threshold can be adjusted for each comparison bloc.
This bloc returns a "1" result if the distance is underneath the threshold.
If the comparison blocs return a unique "1" result, then the encoder returns the corresponding epipolar line index.
If no comparison bloc returns a "true" result, the point is irrelevant and considered as picture noise.
If more than one comparison blocs returns "1", then we consider that we have a correspondence error and a flag is set.
The selected index is then carried to the next stage where the coordinate is calculated. It allows the selection of the right parameters to the depth model.
We compute , rather than as we said earlier, to have a simpler computation unit. This computation bloc is then identical to the estimation bloc.
Finally, after computation, the 3D coordinates of the laser dots accompanied by the image of texture are sent to an external reader. So, Cyclope is equipped with a block of wireless communication which allows us to transmit the image of texture, the coordinates 3D of the centers of the spots laser, and even to remotely reconfigure the digital processing architecture (an Over The Air FPGA). While attending the IEEE802.15 Body Area Network standard , the frequency assigned for implanted device RF communication is around 403 MHz and referred to as the MICS (Medical Implant Communication System) band due to essentially three reasons:
(i)a small antenna,
(ii)a minimum losses environment which allows to design low-power transmitter,
(iii)a free band without causing interference to other users of the electromagnetic radio spectrum .
In order to make rapidly a wireless communication of our prototype, we chose to use Zigbee module at 2.45 GHz available on the market contrary to modules MCIS. We are self-assured that later frequency is not usable for the communication between the implant and an external reader, due to the electromagnetic losses of the human body. Two Xbee-pro modules from the Digi Corporation have been used. One for the demonstrator and the second plugged on a PC host where a human machine interface has been designed to visualise in real-time the 3D textured reconstruction of the scene.
To demonstrate the feasibility of our system, a large-scale demonstrator has been realised. It uses an FPGA prototyping board based on a Xilinx Virtex2Pro, a pulsed IR LASER projector  coupled with a diffraction network that generates a 49-dot pattern and a CCD imager.
FPGA is used mainly for computation unit but also to control image acquisition, laser synchronisation, analog-to-digital conversion, and image storage and displays the result through a VGA interface.
(i)a global sequencer to control the entire process,
(ii)a reset and integration time configuration unit,
(iii)a VGA synchronisation interface,
(iv)a dual port memory to store the images and to allow asynchronous acquisition and display operations,
(v)a wirless communication module based on the ZigBee protocol.
A separated pulsed IR projector has been added to the system to demonstrate the system functionality.
The computation unit was described in VHDL and implemented on FPGA XilinX VirtexIIpro (XC2VP30) with 30816 logic cells and 136 hardware multipliers . The synthesis and placement were achieved for 49 parallel processing elements. We use here 28% of the LUTs and 50 hardware multipliers, for a working frequency of 148 Mhz.
To estimate the evolution of the architecture performances, we have used a generic description and repeat the synthesis and placement for different pattern sizes (number of parallel operations). Figure 20 shows that in every case our architecture mapped on an FPGA can work at least at almost 90 Mhz and then obtain a real time constraint of 40 milliseconds.
Performances of the distortion correction.
Regarding size and latency, it is clear that the results are suitable for our application.
Centers computation performances.
Precision versus the size of the stereoscopic base.
Base of 0.5 cm
Base of 1.5 cm
We have used the calibration results to reconstruct the volume of an object (a 20 cm diameter cylinder). The pattern was projected on the scene and the snapshots were taken.
The pattern points were extracted and associated to laser beams using the epipolar constraint. The depth of each point was then calculated using the appropriate model. The texture image was mapped on the reconstructed object and rendered in an VRML player.
Recapitulation of the performances.
Processing block power consumption estimation.
1h 26 min
These two tools use the processing frequency, the number of logic cells, the number of D flip-flop, and the amount of memory of the design to estimate the power consumption. To realise our estimation, we use the results summarised in Table 7. Our estimation is made with an activity rate of 50% that is the worst case.
To validate the power consumption estimation in an embedded context, we consider that a 3V-CR1220 battery (3V-CR1220 is a 3 Volt battery, its diameter is of 1.2 cm, and its thickness is of 2 mm) which has a maximum of 180 mAh power consumption, that is to say an ideal power of 540 mWh. This battery is fully compatible with a VCE like the Pillcam from Given Imaging.
As we can see, the integration of a Virtex in a VCE is impossible because of the SRAM memory that consumes too much energy. If we consider the IGLOO technology based on flash memory, we can observe that its power consumption is compatible with a VCE. Such technology permits four hours of autonomy with only one battery, and twelve hours of autonomy if we used three 3V-CR1220 in the VCE. This result is encouraging because at this time the mean duration of an examination is ten hours.
We have presented in this paper Cyclope, a sensor designed to be a 3D video capsule.
we have explained a method to acquire the images at a 25-frame/s video rate with a discrimination between the texture and the projected pattern. This method uses an energetic approach, a pulsed projector, and an original CMOS image sensor with programmable integration time. Multiple images are taken with different integration times to obtain an image of the pattern which is more energetic than the background texture. Our CMOS imager validates this method.
Also we present a 3D reconstruction processing that allows a precise and real-time reconstruction. This processing which is specifically designed for an integrated sensor and its integration in an FPGA-like device has a low power consumption compatible with a VCE examination.
The method was tested on a large scale demonstrator using an FPGA prototyping board and a pixels CCD sensor. The results show that it is possible to integrate a stereoscopic base which is designed for a integrated sensor and to keep a good precision for a human body exploration.
The next step to this work is the chip level integration of both the image sensor and the pattern projector. Evaluate the power consumption of the pulsed laser projector considering the optical efficiency of the diffraction head.
The presented version of Cyclope is the first step toward the final goal of the project. After this, the goal is to realise a real-time pattern recognition with processing-like support vector machine or neuronal network. The final issue of Cyclope is to be a real smart sensor that can realize a part of a diagnosis inside the body and then increase its fiability.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.