Abrir menu principal

Biquipédia β

Biquipédia:Taberna

Processing Mirandese as PortugueseEditar

There is an error in the configuration of Mirandese-language wiki projects. Text on this wiki is treated as if it is in Portuguese. In general, treating one language as another gives poor results. Sometimes it does nothing, and often it does the wrong thing. However, Mirandese and Portuguese may be similar enough that doing so provides a significant benefit. It is difficult to know without being familiar with both languages. I'm hoping you can give us your opinion. Unless there is a good reason to keep treating Mirandese as Portuguese, the change will begin the week of October 9th. Please let me know of any concerns. You can read more about the overall project (in English) on Mediawiki.org [1].

Obrigado! TJones (WMF) (cumbersa) 14h39min de 4 de outubre de 2017 (UTC)

Hello Trey Jones, thank you for your answer, I agree about treat one language as another gives poor results, but unfortunately not all the texts of the system are translated into Mirandese language, we're working on it, translating the texts of MediaWiki in Mirandese on Translatewiki.net. I think that the Portuguese is the second language of the system of mwl.wiki, because the native speakers of the Mirandese language also know and speak the Portuguese language, in fact even more than their own language, and because of the system is still being translated. The texts and codes of mwl.wikipedia needs update, for example the feminine word of User (Outelizadora) and neutral word (Outelizador(a)) does not appear, and is displayed only the masculine word (Outelizador), and also we do not have yet the namespaces: Book (Lhibro), Education Program (Ansino), TimedText, and the namespace Portal does not appear in Special:AllPages. I also invite the administrator Alchimista to participate in this talk. Greetings from the Mirandese and Portuguese communities. Athena in Wonderland (cumbersa) 21h20min de 4 de outubre de 2017 (UTC)
Hi Athena. Thanks for the quick reply, and thank you for replying in English! I'm sorry that I wasn't clear. We do not want to remove or change the system messages that are in Portuguese. The concern is that the text of articles is processed for searching as if it were Portuguese. So, in the article Porgrama de cumputador, the text, "Un porgrama de cumputador ó porgrama anformático ye ua coleçon de anstruçones que çcriben ua tarefa a ser rializada por un cumputador...." is currently processed by the search engine as Portuguese. It does some things right, and some things wrong. The question is whether Mirandese and Portuguese are close enough to keep processing the text like Portuguese. For example, Portuguese processing discounts um, uma, de, ou, que, and por, but doesn't know to discount un, ua, ó, or la. It doesn't handle d', l', and qu', but neither would turning it off. It does some useful things, like treating porgrama and porgramas the same, so searching for one finds the other. It will handle some inflections but not others. So the question is whether the partial correct processing is worth the inconsistency and possible occasional error. TJones (WMF) (cumbersa) 13h15min de 5 de outubre de 2017 (UTC)
Hi TJones, sorry for the delay to answer. The Mirandese texts in the search engine treated as Portuguese and other Mirandese words that can not be displayed are a problem. What will happen if the processing (like Portuguese) is removed from mwl.wiki? It would be necessary to create a code or tool for the Mirandese language? And the results would show the texts in Mirandese and with the apostrophe omitted in words? Athena in Wonderland (cumbersa) 17h28min de 9 de outubre de 2017 (UTC)
Hi Athena—without the Portuguese analyzer, mwl.wiki would get the "default" analyzer, which is what most wikis have. It does basic processing; for Mirandese text, that would include lowercasing everything and removing punctuation, though it would keep apostrophes in the middle of words. So 'xx, xx', and "xx" would all match xx. x'x would be one word, but would not match xx. x"x would be split into two words. There would be no stop words, so no words would be omitted from being indexed, though the math that does matching does discount the most common words and emphasizes the rarer words.
It's possible to create custom processing for Mirandese, but we don't have the capacity to do it to the same level of detail as Portuguese. However, I could configure some basic processing like stripping d', l', and qu' if I had a full list. I could also probably enable custom stop words if I had a list. Here are lists for Portuguese and English. (Note that | is used as a comment marker, and some words are mentioned but not used.) It's not just a matter of translating the list, but also thinking about similar kinds of words, and whether words have multiple meanings. (So, can in English means both Portuguese poder and lata, but the poder meaning is more common. However, the list I linked to disabled it as a stop word, so that we get better matches on the lata meaning. Similarly for maymaio vs poder again, and others.) I think the English list is more aggressive and from line 207 down adds a lot of other common words. Making an exact list is an art more than a science. Working on such a project would be fun, so I'm up for it, though it wouldn't be my main project, so it would take a few weeks to build and test. TJones (WMF) (cumbersa) 16h23min de 10 de outubre de 2017 (UTC)