[B] encode database as utf8 instead of windows-1252

Bug #394833 reported by Sijing
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MVHub
Confirmed
Medium
Dan MacNeil

Bug Description

we allow users to enter 'smart quotes' and the like that are not valid w3.org html chars
should strip them

Subroutine to modify:

       mvhub-lib/lib/MVHub/Utilis.pm:_clean_cgi_params

Sijing (sshen)
Changed in mvhub:
assignee: nobody → Sijing (sshen)
importance: Undecided → Medium
milestone: none → 2009-08
status: New → Confirmed
Revision history for this message
Dan MacNeil (omacneil) wrote : Re: [Bug 394833] [NEW] [B] We should strip smart quotes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

We should look also at stripping or replacing chars that aren't valid
for us like \n

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFKTQ/4LzI3mETyffwRAtGCAKCPTw7HoudenR4qVT4if+Xsty91BwCggxVB
+hYQTO0X4oqVQWXmQ1Oyv1Y=
=m3iB
-----END PGP SIGNATURE-----

Revision history for this message
Dan MacNeil (omacneil) wrote : (no subject)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

We should also look at the HTML::Template ESCAPE=HTML attribute

Perhaps to write a test to be sure it is being used everywhere it can be.

see:
      man HTML::Template
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpSzkMACgkQLzI3mETyffxMIgCfV8IqA3EQjjLHgsIZrjFpuh9R
ILgAn1LVsQtr72BK3aFh+AQXU0E4JZ6Z
=8777
-----END PGP SIGNATURE-----

Sijing (sshen)
summary: - [B] We should strip smart quotes
+ [B] encode database as utf8 instead of windows-1252
Revision history for this message
Sijing (sshen) wrote :

This bug was formerly known as: "[B] We should strip smart quotes"

Scope expanded to fixing all invalid html

steps to make uf8 live

# -3) translate database dump

# -2) verify translation is correct
  charmap.pl # etc etc
  regex for valid utf8

# -1) translate test data files from ***NEW** branch
   # mvh
   # nsp

# 1) load data
     app-mvhub/doc/checklists/drop_and_load_database.txt
     app-mvhub/doc/checklists/setup_sites.txt
# 2) create new branch
   cdws
   bzr branch lp:mvhub/trunk move_db_to_utf8
   mv_set_active # choose new branch
   # copy translated test data files to new branch

# 3) modify app-mvhub/doc/checklists files that create or re-create dbs

  # check for correct value of -E
     man createdb
     firefox google.com # check postgresql docs for possible db encodings

     app-mvhub/doc/checklists/drop_and_load_database.txt
     app-mvhub/doc/checklists/setup_sites.txt

4) modify include file that says what char set we use

  # research correct charset name for utf8
  # google terms: html reference utf8 "content="text/html; charset="

  $EDITOR app-mvhub/DocumentRoot/static/all/inc/header.inc

5) re-enable test that skips smart quotes
    look in:
 app-mvhub/t/mech/pages.t

     ---possibly mark TODO

# 5a) think about a merge request

6) add tests that check if html pages
   containing code points above 127 are valid html
    a. check diff file or other output
      for (2) agency names that contain multi-byte or special utf8
    b. (optional) use http://mvh.sshen.testing123.net to find URLS
    c. add URLS for those agencies to:
       app-mvhub/t/mech/pages.t

    d. ---possibly mark TODO

##############
future work
    modify input scripts to complain if web gives us invalid utf8
    modify output scripts to encode as html entites where needed:
 ↦ &qout; etc

Revision history for this message
Priya Ravindran (priya) wrote :

merge from trunk frequently

man perlreftut

#################
misc useful:
  bzr mkdir
  bzr add

#################

find failing tests
      testfiles:
       app-mvhub/t/mech/pages.t line 278 : $mech->html_lint_ok("Lint: ");

       Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/local/share/perl/5.10.0/HTML/Lint.pm line 107.

       Failed test 'HTML tidy: guide.pl_rm_show_agency_agency_id_100088' at app-mvhub/t/mech/pages.t line 324.

understand failing tests
       more ????.t
       man HTML::Lint
  probably: man HTML::Lint
       perl books
       dan ?

###########
add tests info
  app-mvhhub/long_runtime_tests/all_pages_valid_html.t

# part of @urls;
http://mvh.$USER.testing123.net/cgi-bin/mvhub/guide.pl?rm=show_agency&agency_id=100088

# part of @urls;
http://mvh.$USER.testing123.net/cgi-bin/mvhub/guide.pl?rm=show_program&program_id=506655

select program_id from program limit 10;
select agency_id from agency limit 10;

# look at get_dbh{$ENV{MV_CONFIG_FILE}}
more lib-mvhub/lib/MVHub/Common

man HTML::Lint
man DBI

###########
  app-mvhhub/long_runtime_tests/all_pages_valid_html.t

use Test::More;
use MVHub::Common qw/get_dbh/;
use Test::WWW::Mechanize;

{# main
   my $dbh=MVHub::Common::get_dbh{$MV_CONFIG_FILE};

   my @urls =get_urls($dbh);
   my $number_of_tests=scalar @urls;
   Test::More::plan => $number_of_tests;

    my $mech = Test::WWW::Mechanize->new();
    $mech->default_header( Authorization => TestHelper::create_temp_auth() );

    urls_lint_ok( $mech, @total_pages );
    urls_html_tidy_ok( $mech, @total_pages );

}
sub get_urls {
  my $dbh=shift;
  my @urls;
# ...
  return @urls;
}

urls_lint_ok {
   my $mech=shift;
   my @urls =@_;
   for my $url (@urls){
       # run test;
   }
}

sub urls_html_tidy_ok {
   my $mech=shift;
   my @urls =@_;
   for my $url (@urls){
       # run test;
   }
}

###########

find how to encode output as html entities
  --find library
 search.cpan.net
 aptitude search
 man -k
    test library
        save page from website
     run test on page

find places code does output

modify all output code to ouput valid hmtl (utf8 encodes as hmtl entities

#################
Email from Dan
The particular bit of your useful knowledge is getting

 SELECT program_id FROM program;
 SELECT agency_id FROM agency;

into an array of URLS like:
 http://mvh.yosi.testing123.net/cgi-bin/mvhub/guide.pl?rm=show_program&program_id=510758
http://nsp.sshen.testing123.net/cgi-bin/mvhub/guide.pl?rm=show_agency&agency_id=103927

Revision history for this message
Dan MacNeil (omacneil) wrote :

current status:

  test Database is translated (in utf-8_master branch)

   test written to test all pages for valid html ---are all high ascii chars encoded as HTML::Entities

cgiapp_postrun can be modified to encode all html as valid entities.

Unfortunately, encoding all html as html entities makes all pages look like "view page source"

We need to figure out how to encode stuff as it comes out of the database and before it is assembled into pages.

Dan MacNeil (omacneil)
Changed in mvhub:
assignee: Sijing (sshen) → Priya Ravindran (priya)
Dan MacNeil (omacneil)
Changed in mvhub:
assignee: Priya Ravindran (priya) → Dan MacNeil (omacneil)
Revision history for this message
Dan MacNeil (omacneil) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.