Intelligent and Interactive Systems

User Tools

Site Tools


703073 Example Papers and Reviews

Stabinger et al. 2016

Hangl et al. 2017

Review 1

Reviewer 3 of IROS 2017 submission 1386

Comments to the author

This paper describes an approach for skill-centric software
testing in robotics. Specifically, the authors propose to
identify the source of error (the erroneous function) from
a set of test runs of skills where each skill consists of
multiple functions. Identification of deviations from
previous skill execution is based on an autoencoder-like
neural network anomaly detection on a time series of sensor
data. Probabilistic inference and selection of the next
skill is based on Bayesian methods with the aim to maximize
information gain with respect to the probability of the
functions being erroneous. Benefits of the proposed method
are said to include goal-directed testing without manually
implementing test cases and the re-use of sensor data
collected during execution or teaching.

The approach suggested by the authors sounds reasonable.
Motivation and language are clear, some relevant related
work is discussed (although the references [13]-[21]
regarding fault detection could be discussed more
distinct). A case study on an existing robot for
manipulation tasks is presented and illustrates the method.

Specific comments:

It is stated that existing approaches regarding fault
detection are not applicable because they are model-based.
Although no model needs to be manually designed for the
approach proposed by the authors, a measurement observation
model (MOM) is constructed. Some intuition why this kind
of model cannot be used for model-based fault detection
would be helpful for the reader and would strengthen the

Section III.A. could be improved:
- The theoretical background could be presented more
formally. For example, an equation quantifying the
reconstruction error would be helpful.
- It would be interesting to read more about the motivation
to chose the particular network architecture. Besides the
motivation for GRUs, only few details are given.
- Theory and experiments appear to be mixed a bit here.
What does it mean that the network is trained for 500
epochs in five minutes? Doesn't this depend on the specific
data for the particular application and its dimensionality?
What are the data sets used for training and testing? Also,
the forward reference to Figure 4 is probably not required
- "After such a network is trained on multiple time series
of successful task executions, we can assume that ..." How
is this assumption verified? This is an important
assumption in the following.

In the simulated experiments (Figure 2), why is the
expected information gain of F_a3 that low? Given that f_1
and f_2 are equally likely to be the source of error, why
is F_a2 prioritized that much compared to F_a3? It would be
helpful to read some discussion about this in Section IV.A.

Also regarding Figure 2, how can the probability of f_3 and
f_6 being erroneous be that low given that no skill
including them has been tested? Is there an assumption
being made in the model that only one function has an
error? This did not get clear to me then before.

Regarding Figure 4, the term "reconstruction error
likelihood" is implicitly defined in Section III.A., but
its usage is a bit confusing here as it might be associated
with "the likelihood that there is a reconstruction error".
Also, in which unit is this error measured or how is it

Comments on the Video Attachment

The video illustrates the proposed method in an example
Suggestions: It doesn't get clear that there is a bug
causing an offset in planned trajectories. Also, it doesn't
get clear that skill 1 uses planning while skill 2 does

Review 2

Reviewer 8 of IROS 2017 submission 1386

Comments to the author

This paper presents a method for automatic testing of
robotic systems. It relies mainly on two models learned
from demonstrations. First, a measurement observation model
(MOM), which captures the probability of a successful run
given a certain measurement. This model is learned by using
encoder/decoder networks. Second, a distribution of typical
function activations during a run. This model is captured
by simple statistics. Based on these two models, the
authors propose to compute a probability of which function
is buggy given new measurements. 
They show the applicability to robotic applications in a
simulation experiment and a set of real robot experiments.

In general, the paper is well motivated and most of the
ideas are presented in a clear manner. The approach to
learn typical fingerprints of successful runs and using
this information to infer which modules are likely to have
failed in unsuccessful runs is a very interesting idea. In
particular, such approaches could mitigate the tedious work
of writing dedicated tests for complex systems and/or
creating complex simulation models of the system. In my
opinion, the paper has some shortcomings in the
experimental evaluation. The experiments are based on
simulation and on real robots, but a convincing
quantitative evaluation of the method is missing. The paper
describes several 'typical scenarios', but it is not clear
to me how robust the method works in practice. Since I'm
not an expert in testing frameworks I cannot evaluate the
novelty of the proposed approach. 

To improve the clarity of the presented method, several
points should be addressed:

The mean and standard deviation of the reconstruction error
are fixed values for each time step. How does that fit to
the claim that the approach can also handle closed-loop
controllers which depend on the current measurement? For
example, in the third experiment, where the robot waits for
an object to be placed in its hand - how can the
probability p(succ | M, t_fail) be evaluated here when the
waiting time varies heavily in different executions?

Equation (9) describes the probability distribution over
the fingerprints F conditioned on the observation M. Again,
the conditioning is 'required in order to include
closed-loop controllers'. However, Eq. (9) indicates that
there is no dependency on M in the learned distribution. Or
is there a different distribution learned for each M? If
this is the case, how can the distribution be evaluated in
the testing phase when encountering measurements that have
never been observed before? In my opinion the paper should
state more clearly how p(F|M,..) is modeled and learned.

In Fig. 4, the positive test samples are very hard to see.
The blue lines hardly differ from the green lines. In the
first plot (a), some of the negative test samples seem to
have a reconstruction error likelihood that is comparable
to the positive ones. Did the algorithm fail in these cases
or was it able to detect the failing modules nevertheless?

In the real world experiments, a quantitative evaluation
would be nice to see. Performing the same task over and
over (with different buggy modules), how often did the
proposed method infer the failing modules.

Comments on the Video Attachment
courses/2020w/703073/examples.txt · Last modified: 2020/11/17 09:27 by Justus Piater