Help with PicoSearch

How can I control the internationl language character set used at indexing time?

Character set becomes an important issue only if it's not working for you. Most world languages are going to be indexed just fine by PicoSearch, so you won't even have to think about it. Many languages have been translated for PicoSearch results displaying; see the Results Language setting of your account manager, and please feel free to contact us to request a new language.
 
PicoSearch will search all single-byte character set languages. This includes the non-Asian languages, and some Asian languages as well. But Asian languages which have hundreds or thousands of glyphs must use double-byte character sets, and these are supported individually by PicoSearch only as they are developed, including for the concordance results. Check the Alternate Character Options section of the Indexing Topics in your Account Manager for these major choices as they become available.
 
Single-byte non-Western: Usually you'll be fine. If you have a language with a majority of non-western characters, such as Arabic or Cyrillic or Hebrew, you may find that the Exact Phrase searching mode works best to prevent extra results, see Any/All/Exact Initializing. Also, if you intend to search other than ISO-Latin1 characters in PDFs or other special filtered formats, be sure to test the searching first to your satisfaction. You can request a trial version of PicoSearch Professional to do this, just contact us.
 

UTF-8 Support: UTF-8 Unicode is increasingly popular with hosters because it can include any language rather transparently. The cost of this is that UTF-8 is not just another single-byte character set; plain Western characters are single byte, but accented characters take a varying number of bytes. So it may look good in a browser, but the actual language is less specified, and language sensitive software like search engines may have more (not fewer) problems.

For maximum search compatibility, PicoSearch currently supports UTF-8 by automatic conversion to an equivalent single-byte character set. This conversion will be transparent to your searchers and only be used during the search results display, so paying accounts that use UTF-8 on their website should specify PicoSearch's equivalent set in the template's http-equiv meta, or simply leave out that meta entirely so the browser can decide. The default conversion for Western European languages will be the ISO-8859-1 set, so languages like French, German, Spanish, and Italian all work perfectly when the search results are displayed, and the links go back to your site normally. For non-West European languages that aren't handled by ISO-8859-1, such as Russian, Arabic, and Hebrew, if you set your account manager's Results Language display choice before indexing, then UTF-8 will be converted to an appropriate ISO or Windows set. Your account manager's Alternate Character Options section will say what character set was decided upon by PicoSearch. Only the purely double byte languages like Chinese and Japanese won't work even though UTF-8 made them seem as easy as a one byte language.
 
Funny characters in the results? If you have some UTF-8 characters mixed into your web pages (and your editor may do it for accents without telling you), then your pages should be declared as UTF-8. If you have no declaration or an incorrect declaration, PicoSearch may conclude that your pages are ISO and you may find some funny characters in the search results. It doesn't help that browsers tend to hide this and many other HTML problems, so your site could be developed for a while before you realize something is actually incorrect. To fix the funny characters, reindex after doing one of the following:
  1. make sure you have the following UTF-8 meta equivalent declaration in the head of your HTML:
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

  2. OR if you stick with ISO for your page then make sure your accented characters are safely unambiguous, either by being in the same ISO set or, best of all, by using the fool-proof HTML character entities like &eacute; for (the e acute accent).
A note on the Euro sign €: The Euro sign is historically unusual since it is both a recently invented character and very common, so most browsers are forgiving regardless of character set. The best way to display the Euro sign in your HTML is with the spelled character entity &euro; no matter what character set you're in. By typing &euro; then you won't have to worry that technically the Euro sign is only in the ISO-8859-15 set which mostly copies ISO-8859-1, and a UTF-8 Euro technically isn't the same as an ISO Euro.
 
Determining Character Set: To correctly break apart your text into all the component words by distinguishing letters from punctuation, PicoSearch needs to decide what your dominant character set is at indexing time. Character set is bigger than language, so don't worry that this means you can't be multi-lingual; it just means that you have to choose the right character set for your languages. Browsers need to know this information too, so if you're designing a non-English, non-West European site then you probably already know which character set to use and where to add it to your web pages. When displaying search results, PicoSearch will insert the charset for the browser in the output HTML page of Free accounts. If you have a paying account, you'll have to put the charset codes that you want in your Customize Template section, since the HTML page design is under your control.
 
PicoSearch will pick up the character set of your site's pages as specified in any one of the following ways. In these examples, ISO-8859-1 is the most common set which works for English and West European languages, and it is the default but it also never hurts to state it explicitly too.
  • HTTP Equivalents in the HTML head
    1. <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
      (most common)

    2.  
    3. <meta http-equiv="charset" content="ISO-8859-1" />
      (less common)

    4.  
    5. <meta charset="ISO-8859-1" />
      (Internet Explorer only, not recommended)

    6.  
  • Server specified
    In the encoding field that the HTTP equivalent overrides.
The following list is the character sets that PicoSearch knows about, providing that the set is specified for your pages as mentioned above. If you are using an unknown or unspecified set, ISO-8859-1 will be used by default. A symptom of PicoSearch not using the right set would be finding a non-Western character individually. PicoSearch will tell you the character set it used for your index in the Alternate Character Options section of your Account Manager. If you have any problems, please just contact us.
  • Western European, ISO Latin1 (ISO-8859-1)
    most versatile, includes English, Spanish, French, German, Italian, Portugeuse, Dutch, Danish, Swedish, Catalan, and more
  • Central European, ISO Latin2 (ISO-8859-2)
    covers the Slavic languages and more, including Czech, Hungarian, Polish, Romanian, and Croatian
  • South European, ISO Latin3 (ISO-8859-3)
    special set for Maltese and some others
  • North European, ISO Latin4 (ISO-8859-4)
    for Estonian and Baltic languages including Lithuanian, Latvian, and Lappish
  • Cyrillic, ISO (ISO-8859-5)
  • Arabic, ISO (ISO-8859-6)
  • Greek, ISO (ISO-8859-7)
  • Hebrew, ISO (ISO-8859-8)
  • Turkish, ISO Latin5 (ISO-8859-9)
  • Nordic, ISO Latin6 (ISO-8859-10)
  • Thai, ISO Latin/Thai (ISO-8859-11)
  • Baltic Rim, ISO Latin7 (ISO-8859-13)
  • Celtic, ISO Latin8 (ISO-8859-14)
  • "Euro" Western European, ISO Latin9 (ISO-8859-15)
  • South-Eastern Europe, ISO Latin10 (ISO-8859-16)

  • Central European, Win Latin2 (windows-1250)
  • Slavic, Windows Cyrillic (windows-1251)
  • Western European, Win Latin1 (windows-1252) compare to ISO Latin1
  • Greek, Windows (windows-1253)
  • Turkish, Win Latin5 (windows-1254)
  • Hebrew, Windows (windows-1255)
  • Arabic, Windows (windows-1256)
  • Baltic, Windows (windows-1257)


Back to FAQs