Bug #1408937 “[request] Self learning: Should be possible to rec...” : Bugs : SikuliX

RaiMan (raimund-hocke) on 2015-01-09

Changed in sikuli:
status:	New → In Progress
importance:	Undecided → Medium
assignee:	nobody → RaiMan (raimund-hocke)
milestone:	none → 2.0.0

Revision history for this message

RaiMan (raimund-hocke) wrote on 2015-01-09:

#1

The basic approach is already implemented:
setFindFailedResponse() (see docs)

What is missing is the possibility to recapture and then retry.
Additionally one should have the option to specify optional handlers (pre, instead, post)

The complete image handling will be revised in version 2 and we will then have such a feature.

There is one caveat though:
This only works for situations, where the resulting match with the re-capture does not need to be exact in position and size related to the original image.
If in the script you are using such features as targetOffset() or calculate relative positions with respect to the match, this might not work with the re-captured image.
Currently I do not have any idea, how to solve that.

Revision history for this message

anatoly techtonik (techtonik) wrote on 2015-09-15:

#2

Download full text (3.3 KiB)

Yea, I was thinking about that some years ago when I first discovered Sikuli. The reason that I had to train Sikuli for different systems and sometimes the time necessary to write new script was 5x times higher that just clicking the stuff for a month. So my automation with "manual learning" (writing scripts) needed 5 months to feel the result, and worse thing - it constantly needed improvements. So, I've abandoned Sikuli at that moment - the pivot to go for Java was also a reason to abandon it - I don't have type for prototyping in Java.

So, over these years the idea stayed the same, but now I know a little bit more about underlying technology that can enable it. I think that we need - http://caffe.berkeleyvision.org/ or http://deeplearning.net/software/theano/ plus speed up on GPU. But that's about technology. The overall scheme:

Sikuli maintains a database (or better said - matrix) of images that is known to it. Yes, that means a lot of data. Nobody says that these should be images like we do now. We don't store images in our brain - we store the effect from that images. So, the images are just the source data. The request to database - I will call it 'matrix' (just to keep everyone confused) - can answer the questions:

1/ if the current state of the image on the screen is known to it or not
2/ if there any known areas
3/ if there any known areas, with unusual content
4/ if there are any unknown areas
5/ if there is any unknown state

So, the idea is that system SHOULD KNOW ABOUT ALL states. The non-important states can be generalized, but the idea is that system ALWAYS AWARE of what's going on on the screen, and just chooses to react or not.

So, 'area' is the thing that is specific for us. If we are operating with GUI windows, we need to train the concept of window. This trained data can then be contained in some 'domain' (specific part of matrix), so that Sikuli can count windows without doing a second guess about their content. So, it is a layered concept model, much like in humans.

1. Generic screen - I see something and don't see
2. Windows - I see windows, I know how it looks like
3. Context - I know windows, I track what is inside, how many of them and where are they

The things that can help to understand me are:
- HMM (hidden markov models) used in speech synthesis/recognition and parameters
- predator - self learning algorithm in video image recognition

So, we need to select how to we train system to build a loop.

[observe] --> [detect] --> [action] --> [learn] --> <-- repeat

[observe] is just screen watching and matching to matrix.
[detect] is evaluating the condition of the system - is it 'worried' that
there is something unknown or something is not a good match.
[action] is taken AUTOMATICALLY when system is 99.9% sure of
that's going on (e.g. it did that in those conditions 1000 times and
everything was ok).. or if system is not sure, a person can be ASKED
FOR ADVICE.
[learn] depending on the outcome of person's feedback, the new
information about the image and conditions and the desired
outcome is saved in data (associated into matrix).

Yep. I would do this in Python, because it doesn't require
recompil...

Yea, I was thinking about that some years ago when I first discovered Sikuli. The reason that I had to train Sikuli for different systems and sometimes the time necessary to write new script was 5x times higher that just clicking the stuff for a month. So my automation with "manual learning" (writing scripts) needed 5 months to feel the result, and worse thing - it constantly needed improvements. So, I've abandoned Sikuli at that moment - the pivot to go for Java was also a reason to abandon it - I don't have type for prototyping in Java.

So, over these years the idea stayed the same, but now I know a little bit more about underlying technology that can enable it. I think that we need - http://caffe.berkeleyvision.org/ or http://deeplearning.net/software/theano/ plus speed up on GPU. But that's about technology. The overall scheme:

Sikuli maintains a database (or better said - matrix) of images that is known to it. Yes, that means a lot of data. Nobody says that these should be images like we do now. We don't store images in our brain - we store the effect from that images. So, the images are just the source data. The request to database - I will call it 'matrix' (just to keep everyone confused) - can answer the questions:

1/ if the current state of the image on the screen is known to it or not
2/ if there any known areas
3/ if there any known areas, with unusual content
4/ if there are any unknown areas
5/ if there is any unknown state

So, the idea is that system SHOULD KNOW ABOUT ALL states. The non-important states can be generalized, but the idea is that system ALWAYS AWARE of what's going on on the screen, and just chooses to react or not.

So, 'area' is the thing that is specific for us. If we are operating with GUI windows, we need to train the concept of window. This trained data can then be contained in some 'domain' (specific part of matrix), so that Sikuli can count windows without doing a second guess about their content. So, it is a layered concept model, much like in humans.

1. Generic screen - I see something and don't see 
2. Windows - I see windows, I know how it looks like
3. Context - I know windows, I track what is inside, how many of them and where are they

The things that can help to understand me are:
 - HMM (hidden markov models) used in speech synthesis/recognition and parameters
 - predator - self learning algorithm in video image recognition

So, we need to select how to we train system to build a loop.

[observe]  -->  [detect]  --> [action] --> [learn] -->  <-- repeat

[observe] is just screen watching and matching to matrix.
[detect] is evaluating the condition of the system - is it 'worried' that
there is something unknown or something is not a good match.
[action] is taken AUTOMATICALLY when system is 99.9% sure of
that's going on (e.g. it did that in those conditions 1000 times and
everything was ok).. or if system is not sure, a person can be ASKED
FOR ADVICE. 
[learn] depending on the outcome of person's feedback, the new
information about the image and conditions and the desired
outcome is saved in data (associated into matrix).

Yep. I would do this in Python, because it doesn't require
recompilation and is much much faster for prototyping. I think we
don't have such chances with Java - I will spend months just trying
to call one library from another.

Revision history for this message

RaiMan (raimund-hocke) wrote on 2015-09-15:

#3

@Anatoly (welcome back ;-)
I principally understand what you are saying and your suggested approach.
But I have to admit, that this is far beyond what I can invest (both knowledge and time) into the further development of SikuliX.

SikuliX in the first place is a tool, to setup visual workflows basically doing this repeatedly:
wait for some visual to appear on the screen
... and then act on some point on the screen (mouse or keyboard)

... but often complained about:
1. Currently it has no knowledge about what the visual might mean to the human, just a bunch of pixels.
2. No support for moving a workflow to a different environment that shows the visuals differently on the screen in terms of pixels or with a different timing/animation.

With version 2 I will (besides improving the implementation of the API in many aspects) address these 2 areas:
at 1: allow to add more information to a visual, that allows to handle it in different situations (other rendering or background, color variants)
at 2: allow to work with image sets for different environments and support for the setup of these different sets based on an existing workflow

I think this will increase the usability of SikuliX, while staying manageable for me.

Nevertheless: Any ideas or contributions are always welcome.

Revision history for this message

anatoly techtonik (techtonik) wrote on 2015-10-17:

#4

Yes, I understand that this is a no one man army job, so for this to happen we need to find investors who invests in time.

As for SikuliX features, for now I need its screenshot ability wrapped as a Python module to measure its performance.

Revision history for this message

RaiMan (raimund-hocke) wrote on 2015-10-17:

#5

What do you want me to do?

RaiMan (raimund-hocke) on 2019-11-18

Changed in sikuli:
milestone:	2.0.0 → 2.1.0

SikuliX

[request] Self learning: Should be possible to recapture a not found image on the fly and retry

Bug Description

Other bug subscribers

Related questions

Remote bug watches