Reviewer 3 of IROS 2017 submission 1386 Comments to the author ====================== This paper describes an approach for skill-centric software testing in robotics. Specifically, the authors propose to identify the source of error (the erroneous function) from a set of test runs of skills where each skill consists of multiple functions. Identification of deviations from previous skill execution is based on an autoencoder-like neural network anomaly detection on a time series of sensor data. Probabilistic inference and selection of the next skill is based on Bayesian methods with the aim to maximize information gain with respect to the probability of the functions being erroneous. Benefits of the proposed method are said to include goal-directed testing without manually implementing test cases and the re-use of sensor data collected during execution or teaching. The approach suggested by the authors sounds reasonable. Motivation and language are clear, some relevant related work is discussed (although the references - regarding fault detection could be discussed more distinct). A case study on an existing robot for manipulation tasks is presented and illustrates the method. Specific comments: It is stated that existing approaches regarding fault detection are not applicable because they are model-based. Although no model needs to be manually designed for the approach proposed by the authors, a measurement observation model (MOM) is constructed. Some intuition why this kind of model cannot be used for model-based fault detection would be helpful for the reader and would strengthen the contributions. Section III.A. could be improved: - The theoretical background could be presented more formally. For example, an equation quantifying the reconstruction error would be helpful. - It would be interesting to read more about the motivation to chose the particular network architecture. Besides the motivation for GRUs, only few details are given. - Theory and experiments appear to be mixed a bit here. What does it mean that the network is trained for 500 epochs in five minutes? Doesn't this depend on the specific data for the particular application and its dimensionality? What are the data sets used for training and testing? Also, the forward reference to Figure 4 is probably not required here. - "After such a network is trained on multiple time series of successful task executions, we can assume that ..." How is this assumption verified? This is an important assumption in the following. In the simulated experiments (Figure 2), why is the expected information gain of F_a3 that low? Given that f_1 and f_2 are equally likely to be the source of error, why is F_a2 prioritized that much compared to F_a3? It would be helpful to read some discussion about this in Section IV.A. Also regarding Figure 2, how can the probability of f_3 and f_6 being erroneous be that low given that no skill including them has been tested? Is there an assumption being made in the model that only one function has an error? This did not get clear to me then before. Regarding Figure 4, the term "reconstruction error likelihood" is implicitly defined in Section III.A., but its usage is a bit confusing here as it might be associated with "the likelihood that there is a reconstruction error". Also, in which unit is this error measured or how is it calculated? Comments on the Video Attachment ================================ The video illustrates the proposed method in an example scenario. Suggestions: It doesn't get clear that there is a bug causing an offset in planned trajectories. Also, it doesn't get clear that skill 1 uses planning while skill 2 does not.
Reviewer 8 of IROS 2017 submission 1386 Comments to the author ====================== SUMMARY ======= This paper presents a method for automatic testing of robotic systems. It relies mainly on two models learned from demonstrations. First, a measurement observation model (MOM), which captures the probability of a successful run given a certain measurement. This model is learned by using encoder/decoder networks. Second, a distribution of typical function activations during a run. This model is captured by simple statistics. Based on these two models, the authors propose to compute a probability of which function is buggy given new measurements. They show the applicability to robotic applications in a simulation experiment and a set of real robot experiments. REVIEW ====== In general, the paper is well motivated and most of the ideas are presented in a clear manner. The approach to learn typical fingerprints of successful runs and using this information to infer which modules are likely to have failed in unsuccessful runs is a very interesting idea. In particular, such approaches could mitigate the tedious work of writing dedicated tests for complex systems and/or creating complex simulation models of the system. In my opinion, the paper has some shortcomings in the experimental evaluation. The experiments are based on simulation and on real robots, but a convincing quantitative evaluation of the method is missing. The paper describes several 'typical scenarios', but it is not clear to me how robust the method works in practice. Since I'm not an expert in testing frameworks I cannot evaluate the novelty of the proposed approach. To improve the clarity of the presented method, several points should be addressed: The mean and standard deviation of the reconstruction error are fixed values for each time step. How does that fit to the claim that the approach can also handle closed-loop controllers which depend on the current measurement? For example, in the third experiment, where the robot waits for an object to be placed in its hand - how can the probability p(succ | M, t_fail) be evaluated here when the waiting time varies heavily in different executions? Equation (9) describes the probability distribution over the fingerprints F conditioned on the observation M. Again, the conditioning is 'required in order to include closed-loop controllers'. However, Eq. (9) indicates that there is no dependency on M in the learned distribution. Or is there a different distribution learned for each M? If this is the case, how can the distribution be evaluated in the testing phase when encountering measurements that have never been observed before? In my opinion the paper should state more clearly how p(F|M,..) is modeled and learned. In Fig. 4, the positive test samples are very hard to see. The blue lines hardly differ from the green lines. In the first plot (a), some of the negative test samples seem to have a reconstruction error likelihood that is comparable to the positive ones. Did the algorithm fail in these cases or was it able to detect the failing modules nevertheless? In the real world experiments, a quantitative evaluation would be nice to see. Performing the same task over and over (with different buggy modules), how often did the proposed method infer the failing modules. Comments on the Video Attachment ================================