Ruby On Rails + UTF8 + Mysql

Voici plusieurs tutoriaux complets permettant de faire fonctionner la chaîne de liaison depuis le navigateur jusqu’à Mysql en passant par rails.

Le premier :

In a post about Ruby and Unicode a while back, I mentioned the page at the Rails wiki called How To Use Unicode Strings in Rails. (Btw, check out Why the Lucky Stiff’s response to my post, some useful code there.)

It turns out that there were a few more mostly MySQL-specific steps involved in getting Unicode to work correctly with Rails. So I thought I’d describe all the steps we went through to get it set up in one place. This has only been tested with MySQL 4.1.

In MySQL: Set the Encoding when you Create Tables

You need to explicity tell MySQL that you want your tables to be encoded in UTF-8. Here’s a sample table with 3 columns, id, foo, and bar:

create table samples (
id int not null auto_increment,
foo varchar(100) not null,
bar text not null,
primary key (id)
) Type=MyISAM CHARACTER SET utf8;

The line Type=MyISAM CHARACTER SET utf8; is where the action is. The table type has to be MyISAM, not innoDB, because unfortunately innoDB tables don’t support full text searching of UTF-8 encoded content (details here). Apparently innoDB tables are more flexible in general, but if full-text search is crucial for you, you’ll have to go with MyISAM.

Bummer, that.

In any case, then add the CHARACTER SET utf8 directive, as shown.

(I wonder if there is some way to set this as a default, without adding the line to the DDL for every table?)

Set the “charset” and “Content-type” in the Application Controller

This is also described at How To Use Unicode Strings in Rails.

class ApplicationController < ActionController::Base
before_filter :set_charset

def set_charset
@headers["Content-Type"] = "text/html; charset=utf-8"
end
end

The previous steps were enough to get Unicode showing up in a little test app I generated with scaffold (the app just consisted of an input field and a textarea).

To test it, I pasted in some sample text in various languages. It worked okay for text containing only the characters found in ASCII or latin1, but among other characters there were weird cases of random characters being removed or added. The problem seemed not to have anything to do with the script (i.e., the Unicode block). For instance, in an Esperanto text, “ĉ” (U+0109 LATIN SMALL LETTER C WITH CIRCUMFLEX) came out fine, but “ĝ” (U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX) was borked. Go figure.

It took some digging to find the next bit — thanks to Ben Jackson of INCOMUM Design & Conceito for getting the straight story on the Rails list. The solution is…

Tell Rails to tell MySQL to Use UTF-8. (Got that?)

It came down to a MySQL configuration option: You have to tell MySQL to SET NAMES UTF8, as DHH pointed out in the previous link. You can either do it in the source to ActiveRecord, in mysql_adapter.rb, or you can just make the change in your own application. We chose the latter route.

So, here’s our app/controllers/application.rb as it stands:

class ApplicationController < ActionController::Base
before_filter :set_charset
before_filter :configure_charsets

def set_charset
@headers["Content-Type"] = "text/html; charset=utf-8"
end

def configure_charsets
@response.headers["Content-Type"] = "text/html; charset=utf-8"
# Set connection charset. MySQL 4.0 doesn't support this so it
# will throw an error, MySQL 4.1 needs this
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
end
end

(In case you’re wondering, yes, you can have more than one before_filter.)

UPDATE Er, but you can get away with just one: Jonas refactored it. Use this version instead. :)

class ApplicationController < ActionController::Base
before_filter :configure_charsets

def configure_charsets
@response.headers["Content-Type"] = "text/html; charset=utf-8"
# Set connection charset. MySQL 4.0 doesn't support this so it
# will throw an error, MySQL 4.1 needs this
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
end
end

UPDATED AGAIN Argh, I forgot to remove the set_charset definition before. It should be correct now…

Anyway, now our Rails installation seems to handle (almost) any nutty writing system that we throw at it.

I’m not sure these are all the Right Way™, I’m just saying it’s worked for us (so far).

For one thing, it seems like it might make sense to just go ahead and set the default character set and collation for MySQL to UTF-8, independently of any Rails stuff at all. Aren’t all character sets supposed to be slouching toward Unicode, anyway? But that sounds like a rather apocalyptic measure, for some reason… I guess I’ll hold off on the rapture of the character sets.

Collation is it’s own topic — but then we’ve not gotten to sorting anything yet. Here’s a recommendation for building the tables with:

CHARACTER SET utf8 COLLATE utf8_general_ci;

To set the collation order. Wading through the MySQL docs on that is next on the agenda.

source

Un autre tuto assez complet également :

Le but de cet article est de synthétiser les paramètres de configuration de rails et de mysql pour n’avoir aucun problème de charset dans la vie d’une application rails.

L’idée est d’utiliser utf-8 comme standard à tous les niveau. Cela peut paraître simple mais ça ne l’est pas trop. Tobias Luetke disait que le plus simple dans rails était de ne rien paramétrer. J’ai essayé (hé oui je sais être paresseux parfois 🙂 ). Mais sans succès.

Du coté de rails version 1.x et supérieur

Dans le fichier config/database.yml

development:
adapter: mysql
database: app_dev
username: myuser
password: myuserpassword
host: localhost
encoding: utf8

Dans le fichier config/environment.rb

ajouter le support(partiel) pour l’unicode

#unicode support
$KCODE = 'u'
require_dependency 'jcode'

Je ne comprends d’ailleurs pas pourquoi ce n’est pas la config par défaut comme c’est déjà le cas pour l’actionmailer.

Dans le fichier app/controller/application.rb

  before_filter :set_charset
def set_charset
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end

if request.xhr?
@headers['Content-Type'] = 'text/javascript; charset=utf-8'
else
@headers['Content-Type'] = 'text/html; charset=utf-8'
end
end

Normalement (voir le point sur mysql, le fait de placer encoding: utf8 dans le fichier database.yml devrait suffire. Enfin pas chez moi !:( Suite à cela je rajoute l’instruction ‘SET NAMES UTF8’ directement avant chaque appel au query.

Dans le fichier app/helpers/application_helper.rb

  #for ajax call use this instead of only h()
def ha(html)
h(html).gsub(/([^\x00-\x9f])/u) { |s|
"&#x%x;" % $1.unpack('U')[0] }
end

pour n’avoir pas de soucis sous safari.

Les mails (un bref aperçu)

je crée par défaut une classe de base dont tous les notificateurs vont hériter

class Notifier < ActionMailer::Base
def setup_email
@sent_on    = Time.now
@headers['Content-Type'] = "text/plain; charset=utf-8; format=flowed"
content_type('text/plain; charset=utf-8; format=flowed')
end
def setup_email_html
@sent_on    = Time.now
@headers['Content-Type'] = "text/html; charset=utf-8; format=flowed"
content_type('text/html; charset=utf-8; format=flowed')
end

end

Enfin pour le sujet du mail j’utilise la fonction quote_if_necessary(@subject,”UTF-8”). En effet les différentes directive (RFC) stipulent que le sujet lors d’un transfert SMTP doit être encodé en 7bit donc pas compatible utf8(8bit). Résultat des course de ce sujet hyper-résumé : transformer le sujet en format uuencode ou en quoted-print.

Bon selon les version d’outlook , on peut avoir encore de belles surprises 🙂 mais ça c’est une autre histoire.

Mysql 4.1+

Si la base de données mysql est souvent fournie compilée avec le latin1 comme charset par défaut.

il sufit de rajouter/moidifier le fichier suivant /etc/my.cnf avec

[mysqld]
#Set the default character set.
default-character-set=utf8
#Set the default collation.
default-collation=utf8_general_ci

#character-set-results=utf8
#character-set-client=utf8
#character-set-connection=utf8

Ensuite il faut créer les tables sans préciser le charset ou avec DEFAULT CHARSET=utf8

Les variables suivantes sous mysql définissent son comportment client-server:

character_set_results=latin1
character_set_client=latin1
character_set_connection=latin1

character_set_server=utf8
collation_server=utf8_general_ci

character_set_database=utf8
collation_connection=utf8_general_ci

Les 4 dernières variables définissent le charset et sa collation (la façon dont sont indexé les accents, influence les resultats de recherche de type FULLTEXT) au niveau de la base de données elle-même c-à-dire la façon dont il stocke l’info.

Les 3 premières définissent le charset auquel s’attend le client qui se connecte.

Dans le cadre de mon example je n’ai pas forcé les paramètres par défaut de la db à utf8 car de nombreuses applications PHP sur le serveur s’attendent à ces paramètres par défaut.

Donc si je ne précise rien, le serveur envoie à mon application rails des caractères latin1. Par contre lors d’un update ( avec juste l’encoding placé dans le fichier database.yml), les caractères sont correctement encodé dans la db. D’ou la nécessité de placer la commande SET NAMES UTF8 qui a pour effet de changer ces 3 variables

character-set-results=utf8
character-set-client=utf8
character-set-connection=utf8

ce qui a pour effet de bien conserver de l’utf8.

On pourrait se poser la question de laisser le latin1 par défaut dans la db, car si l’application précise que l’on veut de l’utf8, mysql convertira le latin1 en utf8.

Le problème survient lors de la gestion de la DB.

Les backup (dump) ou l’édition via des outils web comme PHPMyadmin ou la plupart des client GUI vont poser problème.

mysqldump db_to_export >file.sql --default-character-set=utf8 -u myuser -p --add-drop-table -c

mysql db_to_import 

Pour ceux qui veulent je vous propose un script d’adaptation de vos tables latin1 en utf8



# This script has been written to convert utf-8 character stored in latin1 table in mysql Table
# work for mysql 4.1
database = "database"
sqlconn = ActiveRecord::Base.establish_connection(
:adapter => "mysql",
:host => "localhost",
:username => "root",
:password => "",
:database => database
)

#conn.execute("ALTER TABLE contents MODIFY title BINARY(255);")
#=> nil
#>> conn.execute("ALTER TABLE contents MODIFY title varchar(255) character set utf8;")
#=> nil
#>> conn.execute("ALTER TABLE contents MODIFY body BLOB;")
#=> nil
#>> conn.execute("ALTER TABLE contents MODIFY body TEXT CHARACTER SET utf8;")
#=> nil
#>> conn.execute("ALTER TABLE contents MODIFY title BINARY(255);")
#=> nil
#>> conn.execute("ALTER TABLE contents MODIFY title varchar(255) character set utf8;")

# open a connection to the table
conn = ActiveRecord::Base.send(sqlconn.adapter_method,sqlconn.config)
# run the converion on each table
conn.tables.each do |table|
res = conn.execute("show create table #{table}")
str = res.fetch_hash["Create Table"]
#find which table has the latin1 charset
charset = str.scan(/CHARSET=(.*)/).flatten[0]
if charset == "latin1"
message ="table #{table} process "
#convert each varchar and text field
#res = conn.execute("show FULL columns from contents")
table_columns = conn.columns(table)
table_columns.each do |column|
next unless [:text,:string].include?(column.type)
case column.type
when :text
message << column.name + " "
puts "ALTER TABLE #{table} MODIFY #{column.name} blob;"
#Transform in binary blob
res_step1 = conn.execute("ALTER TABLE #{table} MODIFY #{column.name} blob;")
puts "ALTER TABLE #{table} MODIFY #{column.name} text CHARACTER SET utf8;"
#convert the binary into text with the good charset utf8
res_step2 = conn.execute("ALTER TABLE #{table} MODIFY #{column.name} text CHARACTER SET utf8;")

when :string
message << column.name + " "
#puts "ALTER TABLE #{table} MODIFY #{column.name} binary(#{column.limit});"
#Transform in binary string
res_step1 = conn.execute("ALTER TABLE #{table} MODIFY #{column.name} binary(#{column.limit});")
#puts "ALTER TABLE #{table} MODIFY #{column.name} varchar(#{column.limit}) CHARACTER SET utf8;"
#convert the binary into text with the good charset utf8
res_step2 = conn.execute("ALTER TABLE #{table} MODIFY #{column.name} varchar(#{column.limit}) CHARACTER SET utf8;")
end

end
puts message
#now that every field are updated we update the table charset
res_step3 = conn.execute("ALTER TABLE #{table} DEFAULT CHARACTER SET utf8 ;")
end
end
#now that every table are updated we update the table charset
res_step3 = conn.execute("ALTER DATABASE #{database} DEFAULT CHARACTER SET utf8 ;")

#be sure to change the connection
SET GLOBAL character-set-results = 'utf8';
SET GLOBAL character-set-client = 'utf8';
SET GLOBAL character-set-connection = 'utf8';
SET GLOBAL character-set-server = 'utf8';
SET GLOBAL character-set-database = 'utf8';
SET GLOBAL collation-connection = 'utf8_general_ci';

SET GLOBAL collation-server = 'utf8_general_ci';

source

Leave a Reply

Your email address will not be published. Required fields are marked *