Comment 7 for bug 391623

Revision history for this message
C de-Avillez (hggdh2) wrote :

The Evolution hook works on the BTs, looking for known variables that hold (potentially) private data, and -- for any other variable --, we scan for instances of IP or email addresses, and fully-qualified server names. All matches are replaced by the string '##MASKED##'.

Of course, this will only be fully effective when bug 387933 is resolved for the backoffice.

Meanwhile, the hook seems to be working correctly for the list of Evolution bugs Brian provided me with (BTW, thank you!). The hook currently:

1. Collects Evolution GConf data ( Plugins, Junk Setup, and Prompts subkeys of /apps/evolution); these are added in a [Miscellaneous] string;
2. for each of {Stacktrace, ThreadStacktrace): scans the lines, and replaces any string value for following Evolution variables by the string "##MASKED##":
    r'''(key|url_string|url|filename|filesave|uri|profname|user|source|username|password|server|domain|domain_name) # variables in trace
    ([\s]*[=].+?["]) # intermediate text (class, address, etc)
    (.*?) # what we really want: the string data
    (["][, ]*)''' # the delimiter
3. then we search & replace still-existing instances of email addresses, fully-qualified server names, and IP addresses (in this order), in any other variables.
4. (Currently) writes a *diff* for the changes made (creates two *new* entries in reports[]. This was done because we were not sure of how invasive the changes would be, and considered better to just write a diff, at least for now. *Input needed*

For both FQSN and email addresses we use the following RE for domain names:
    '(aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|[a-z]{2})'
This RE wil match on any of the initial words, or on any two letters.

For IP addresses we use the following RE:
    '([^\d])(\d{1,3}[.]\d{1,3}[.]\d{1,3}[.](\d{1,3}))([^\d])'
This RE will match on *any* dotted sequence of one to three digits, enclosed in non-digits (for example, "[1.2.3.4]"). It will also match on invalid IP addresses (since no limits are set on the range; for example it will match on "a912.513.401.12/".

For email addresses we use the following RE:
    '[\w\.\-]+@[\w\.\-]+[.]' + DOMAIN_NAMES
This RE will match on words (plus '.' and '-', followed by an at symbol ('@') and a DOMAIN_NAME. This is clearly not fully correct (it would allow, for example, for an email starting with '.'), but it is enough.

For FQSN we use the following RE:
    '([^\w])([\w.-]*[.]' + DOMAIN_NAMES + ')([^\w\-]|[\n])'
This RE is very similar to the email RE; the differences are (1) it is pre/post-fixed with non-words, and has a dot instead of an at symbol.

5. Finally, we currently calculate a diff of the changes to Stacktrace and ThreadStacktrace, and add it in the report as [Stacktrace.diff] and [ThreadStacktrace.diff].

6. and exit.

Additional comments:

(a) although the idea is to provide a sanitised stacktrace in order to allow for the bug to be classified Public, I was reluctant to delete the original stacktraces: not only I may be missing something, but also there *might* be a case where the sanitised value would be needed for a full understanding of the issue. This is why we decided to *add* a diff for the changes -- a sanitised stacktrace can then be easily achievable by patching the corresponding stacktrace with its diff. Another option would be to provide the sanitised stacktrace (removing the original) and the, er, reverse diff, in order to get the original one.

(b) option (a) would be, in my view, the ideal scenario, but we would depend on bug 151658 to make attachments and comments private.

I have run this hook against 753 bugs from Brian's list, and it *seems* it is working correcly. The runs were executed by calling the hook with the --report parameter; as currently coded, only the .diffs are printed out.

TO BE DECIDED:

1. should we delete the original traces, and maintain only the saanitised traces (and, perhaps, a reverse diff)?
2. should we save the original traces, and the diffs?

Note that these two options will not allow for the bug to be marked public.

3. should we save *only* the sanitised traces, and mark the bug public?

I will provide test data, based on the runs I have.