×
  • remind me tomorrow
  • remind me next week
  • never remind me
Subscribe to the ANN Newsletter • Wake up every Sunday to a curated list of ANN's most interesting posts of the week. read more

Forum - View topic
ANN Kanji search(Google) works unlike cast crediting dialog?


Goto page 1, 2  Next

Anime News Network Forum Index -> Site-related -> Encyclopedia
View previous topic :: View next topic  
Author Message
Devil Doll



Joined: 07 Jul 2007
Posts: 656
Location: Germany
PostPosted: Fri Mar 20, 2009 10:51 pm Reply with quote
I tried to add the credits for a person named 吉田麻子. The interface told me that there was no person with this name.

Just to be 100% sure I then did a ANN search for "吉田 麻子" which - to my surprise - actually produced one hit, Asako YOSHIDA. This person was a cast with only two minor roles but from the same year as the role I was about to credit her, therefore I entered the Kanji for her family name / given name in a separate browser tab, and was then able to re-verify the cast input without changing anything in my input form.

So it looks like the ANN-internal Google search is somehow brighter than the (still clever) logic for adding cast credits (which does a lot of checking by itself, such as finding apparently wrong orders of family name / given name and even some romanization variants), as it appears to have done separate lookups for both partial names in Kanji and then made the assumption that I actually wanted to search for "Asako Yoshida".

Would it be reasonable (and possible) to use that logic directly within the cast/staff input form? To be precise: If a Kanji input for a person's name doesn't produce a hit, then do separate lookups for both name parts (assuming there were a reasonable logic to split the Kanji string apart - I did that manually with the help of EDICT, but in the case of 4 Kanji a 2:2 split is >90% correct... actually the ANN database could show you more exact numbers than I could guess here), and if both of them produce unique results for their romanization, then (and only then) do another lookup for the combination of these romanizations? This procedure might have found Asako Yoshida directly within the crediting form.
Of course such a logic should provide this result with a grain of salt, i. e. an appropriate warning. But one of its advantages would be that I'd see the link to this person before changing anything about my input, and could inspect this person (and perhaps even add the Kanji for him/her before proceeding with my input form).

Just to check, I removed the Kanji given name for that person and tried to repeat my input in the crediting form, but this time with the Kanji already correctly split between family name and given name. Again, the Kanji produced no hit (understandable), so it's not the splitting that was the problem here, it's the additional logic that the Google-style search procedure apparently shows.
Back to top
View user's profile Send private message Visit poster's website My Anime
dormcat
Encyclopedia Editor


Joined: 08 Dec 2003
Posts: 9902
Location: New Taipei City, Taiwan, ROC
PostPosted: Sat Mar 21, 2009 4:07 am Reply with quote
To put it shortly, the search engine of the Encyclopedia limits the query to listed items i.e. main and alternative titles of anime and manga plus English/romaji names of people and companies. Items within an entry (kanji names, skills and hobbies, company addresses, trivia) are unsearchable at this moment. On the other hand, Google searches EVERYTHING on a page, including kanji names.
Back to top
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger ICQ Number My Anime My Manga
Devil Doll



Joined: 07 Jul 2007
Posts: 656
Location: Germany
PostPosted: Sat Mar 21, 2009 9:21 am Reply with quote
But the page that the ANN specific Google search found didn't have any Kanji names. It was me who added them only afterwards.

To prove my point, I removed the Kanji name entries from Asako YOSHIDA today. But even now a search for "吉田 麻子" produces this page as the only hit even though it has no Kanji at all.
And yes, I'm aware that this page might be indexed somehow and not be fulltext-searched with its current content, so that my recent change might not be significant until that index were to be updated. But my question remains: How did the ANN specific Google search produce this hit before I entered the Kanji for that person for the first time?
Back to top
View user's profile Send private message Visit poster's website My Anime
Dan42
Chief Encyclopedist


Joined: 02 Jan 2002
Posts: 3794
Location: Montreal
PostPosted: Wed Mar 25, 2009 9:51 am Reply with quote
Google is doing some pretty crazy stuff there. But honestly I don't think it's worth the trouble of doing this automated kanji-to-romaji conversion. As you pointed out yourself, we could only do this conversion if a lookup for the kanji turns up one and only one possible romanization. But even then the romanization is merely likely. You'd still have to do some manual sleuthing in order to confirm the romanization. Otherwise we could end up with feedback loop of errors. Imagine if someone mistakenly enters 吉田太郎 as Taro Yoshita. Then, when adding a credit for 吉田麻子 you would find it auto-converted to "Asako Yoshita" and, blindly trusting in the system, create a new entry for Asako Yoshita when in fact the credit should have been filed under the existing Asako Yoshida. From there on, every single 吉田 would be translated to Yoshita. Not good.

Even assuming that we only use this to lookup existing names, that doesn't offer much of an advantage; why duplicate effort if a simple Google search will accomplish the exact same thing?
Back to top
View user's profile Send private message Visit poster's website AIM Address My Anime My Manga
Devil Doll



Joined: 07 Jul 2007
Posts: 656
Location: Germany
PostPosted: Wed Mar 25, 2009 12:02 pm Reply with quote
Perhaps because that Google lookup isn't available from within the credits contribution form, and it would still require the user to correctly split the Kanji name into family name and given name (which his Japanese source might not have done). The ANN software could do that by simply trying all available split positions and check the results; that would certainly better than, say, selecting the first entry of the list of names for these Kanji from EDICT, and it would allow Kanji-only sources to be reasonably evaluated for anime where Roumaji sources are rare or non-existent. (Contributing is meant to be a rare event compared to displaying data, so a few additional SQL queries shouldn't be too expensive when they would lead to better data quality; you're doing the same for potential Roumaji duplicates already.) And yes, I can do the individual lookups for Kanji using the Google search myself. But doesn't that defeat the purpose of the wonderful multi line input mode for adding credits?
The "cascading effect" of potentially duplicating errors might be countered by considering a Kanji-to-Roumaji-guessing translation "reasonable" only if it is 100% according to the ANN database (which is very likely according to my experience - the only exceptions I remember are given names like Akira/Kenji and the first syllable of a family name being "o" or "ko") plus has a threshold of at least, say, 5-10 hits in the database. This might still be a good feature in the case of many popular family names in combination with a rare given name or vice versa. And remember that it would still be a suggestion only, much like you give a warning for different Roumaji writings of the potentially same person. But you're certainly right to consider this a "nice to have" thingy at best, compared to other features.

My general impression is that the current GUI for contributing credits is unfriendly to Kanji input in a number of ways. Such as: Kanji names of persons not found in the database lead to this person being rejected, Roumaji names of persons not found in the database lead this person being created (regardless of potential translation errors - so Roumaji are more trustworthy than Kanji?) while the contributor isn't even given a chance to submit the Kanji name as well. Neither in the feedback message ("person ... has been created") nor in the Anime page where the credit was given for you can even find a link to that person as to edit him/her (as this person has one credit only at this stage), and the ANN Google search won't find the new person either because the index needs to be refreshed first... apparently the shortest way to enter Kanji for that new person's names is to start an error report (!) for the credit, as that's the page where a link to the person is always visible. (Yes, the page of all persons whose name starts with the same letter would be another way but that would put a lot unnecessary load on the database.)
Despite the severe warning by dormcat to be very aware of what you're doing while creating new persons, the GUI willingly accepts any rubbish as Roumaji name (despite your plausibility checks) but at the same time prevents people who can provide additional information (such as Kanji names, thus backing up their Roumaji and allowing for an instantaneous plausibility check by the way - it's not like people providing Kanji data couldn't make typos) from working efficiently. Shouldn't it be the other way round? (Such as offering a link to a separate "create new person" dialog where in another browser tab at least the Kanji names could be given as well, and plausibility checks could be done before actually creating this person. If you're afraid of scaring English-only contributors away by this feature then do this only after you've detected Kanji input already, or maybe depending on a parameter in the configuration of that ANN user - but that would be the super luxury implementation. Or you could simply put a link to that explicit creation process form on the name of the person to be created - which you give a feedback message for already before submitting the form - and the Roumaji-only contributor can happily ignore this link and proceed as usual.)

I wish the "edit person" module would do a database lookup for uniqueness of the Kanji name combination whenever a Kanji family/given name is being entered, thus making the contributor aware of a duplicate person being created in that very moment. This would certainly lead to a number of duplicates being found earlier, and by people who might be able to provide Kanji sources to help solve the issue. I might have missed a number of such duplicates while adding Kanji names for already existing persons because the only way of reliably finding them would have been to fake giving them another credit (which I didn't do as the credit I was inspecting was already there, and it would have been a lot of additional work).
By the way, do Encyclopedists internally have some maintenance tool/script that would list all persons with non-unique Kanji names (grouped by Kanji name), and do they have a way of tagging some of them as "this isn't an error"?

I would dream of the ANN database having Hiragana as an additional third name of Japanese persons, and automatically generating Roumaji names from these (so that the decision which Roumaji to be used at ANN could be changed within that conversion code instead of manually renaming 75000 persons). The users would still provide arbitrary Roumaji input but in most cases it would be possible to translate these back to Hiragana, and thus find Romanization duplicates more reliably (and give a warning to the user if that translation wouldn't work, thus pointing out potential spelling errors in the Roumaji name already). I like the existing checks for certain similar names (finding SATO/SATOU and the like) but I believe a systematic backwards translation to Hiragana would be a better method than checking for an arbitrary list of letter combinations. (You would then at least know that you didn't miss any case you could have handled, and you can't try to find spelling errors without any such Hiragana table.)
Despite being filled automatically during the creation process of the person the Hiragana name fields would be editable like the Kanji name fields, so the existence of a source (Japanese Wikipedia, Allcinema etc.) would allow to tell apart auto-generated from validated content.
I'm aware that certain Japanese people insist on a particular Romanization so the Roumaji name fields would never become obsolete, but only if these fields were explicitly filled (i. e. after the person already existed, and with a different content from the name auto-romanized from the Hiragana, or with some checkbox "this Romanization used explicitly by the person himself" being checked) this name would need to be used. I'm also aware that any such additional feature would only work for Japanese persons, thus English persons would have to be treated differently. And of course I don't expect you to change the database format because of a mere forum posting... I'm just reporting my experience from contributing a few hundred persons' Kanji names here, and I have no clue about the percentage of Japanese persons at ANN that actually have a Roumaji name intentionally differing from the auto-romanized name from their Hiragana, or the percentage of Japanese people with fancy/English/whatever given names where the backward translation to Hiragana would fail. My wild guess is that it would work for 90% of the existing persons.
Back to top
View user's profile Send private message Visit poster's website My Anime
Dan42
Chief Encyclopedist


Joined: 02 Jan 2002
Posts: 3794
Location: Montreal
PostPosted: Wed Mar 25, 2009 6:46 pm Reply with quote
Thanks for your valuable input. You make several good points and I'll try to see how I can incorporate them into the Encyclopedia.

Devil Doll wrote:
By the way, do Encyclopedists internally have some maintenance tool/script that would list all persons with non-unique Kanji names (grouped by Kanji name), and do they have a way of tagging some of them as "this isn't an error"?

We don't have such a report, but we do have a way of tagging information as True/False.

Devil Doll wrote:
The users would still provide arbitrary Roumaji input but in most cases it would be possible to translate these back to Hiragana, and thus find Romanization duplicates more reliably (and give a warning to the user if that translation wouldn't work, thus pointing out potential spelling errors in the Roumaji name already). I like the existing checks for certain similar names (finding SATO/SATOU and the like) but I believe a systematic backwards translation to Hiragana would be a better method than checking for an arbitrary list of letter combinations. (You would then at least know that you didn't miss any case you could have handled, and you can't try to find spelling errors without any such Hiragana table.)

There are too many ambiguities to systematically convert romaji to hiragana. Sato might be written as さとう or さと, Shinichi as しんいち or しにち. So it doesn't offer any advantages over the current normalization-denormalization lookup process ( "Satou" -> Sat{o} -> /Sat(o[ouh]?|ô|ō)/ ). A hiragana-based database might make things more accurate for Japanese names, but there are many non-Japanese names, and this is simply no longer the time for turning the whole dataset on its head.
Back to top
View user's profile Send private message Visit poster's website AIM Address My Anime My Manga
Devil Doll



Joined: 07 Jul 2007
Posts: 656
Location: Germany
PostPosted: Thu Mar 26, 2009 12:43 am Reply with quote
I'm aware of the problems in automatic Roumaji to Hiragana conversion - the trivial solution would not be perfect. Then again, it would not be worse than the Roumaji input of "Shinichi" that wouldn't give you the required information either as of now.
I think these are only a few situations (certain "o"/"ou" situations and the "n" being used as vowel, exactly those you named and fewer than your current mapping between different Romanizations). For example, doing a search for "にち" as substring on EDICT tells me that while there are in fact a number of names that contain these Hiragana all of these begin with them. Which makes me believe that "nichi" should always be romanized to "n'ichi" if (and only if) it's not the beginning of a name. This might even be helpful for your already existing Roumaji-based similarity logic. (Sato/Satou is a more difficult issue that can only be handled with Kanji knowledge available, i. e. on a different level of consistency checking.)
Many aspects of those checks are possible with Roumaji alone. But as long as ANN doesn't use one Romanization method consistenty (and I'm afraid ANN will never be able to do that because of certain persons using a particular romanization for their own name intentionally, so that any particular romanization would always be "wrong" for certain persons) this will always cause additional problems that Hiragana would not have (as they can be added with their correct values regardless of what people think how their name should be written outside of Japan). That's part of the beauty of the Hiragana solution: It could remain an internal-only key.
And yes, I can see how you would not like to add two columns to an existing database table with 75000 entries and then double-check each and every SQL statement accessing this table...but imagine the Hiragana being computed only temporarily by the error hunting script (as a way of normalizing different Romanizations for better comparison). No need to change the database tables permanently if you don't want more than this. (Those "temporary Hiragana" wouldn't even require actual Hiragana as character set, a concatenation of Roumaji of one particular standard for each syllable, plus inserted unique syllable separators, would suffice. "sa-to-u", "shi-n-i-chi", that's all you need for string equality comparison.)
By the way, consistency checks wouldn't necessarily have to run on the live data tables. You could create a copy of the persons table for this purpose, add the Hiragana columns, run the script to auto-fill them, do the checks and save the person_id numbers for fixing the issues in the real table later. No need to change existing SQL code then. You might consider this kind of a maintenance routine.

I even wondered whether a simple additional CGI parameter like animenewsnetwork.com/encyclopedia/people.php?ln=Z&kanji=yes (displaying the Kanji name in brackets behind each Roumaji name) would be helpful for error hunters (do that on the "S" page to tell apart those many SATO/SATOU, and optionally even use the Kanji as alternative sort key triggered by another parameter - the exact order isn't important, just group identical Kanji together in order to see diverging romanizations).
These kinds of tools could lead to a better treatment of those cases where the current Romanizations are not only ambiguous but actually misleading (Sato/Satou, Ito/Itou etc. being the most prominent cases), one permanent source of problems for future input based on Roumaji alone. These aren't too many names, and they could even be handled systematically based on their Kanji names. Such as: If a person has the same Kanji name as another person with a different Hiragana name then one of these most likely will be wrong. (Doing that separately for familiy names resp. given names might even increase the filter quality while missing a few wrong cases.) A script could find these situations; certain Kanji names would always have ambiguous Hiragana/Roumaji equivalents (Akira etc.), so these would go on a blacklist of that filter script to keep the list short while Shin'ichi/Shinichi errors would be found automagically once you have at least one correct entry with the same Kanji name, just like SATO/SATOU errors with the same Kanji (as these would get different Hiragana).
By the way, any consistency check using Kanji will happily ignore English names, simply because these persons won't have a Kanji name. Hiragana-based checks are a different issue... but if you're afraid of handling English names by this then just ignore the Hiragana of all persons who have at least one credit in a language that's not Japanese. Automatic error detection doesn't have to be perfect in order to be helpful. (Actually I'm more afraid of the encyclopedists still being shocked by the first results of such a script and believing they don't have the manpower to handle those errors... just look at the duplicate persons and duplicate companies thread. Finding issues is one thing, fixing them is another one.)

Just one more idea, as it fits into this discussion: There must already have been many contributors who were prevented from doing the best possible input (see GUI discussion above), as they had a Kanji source available (they entered it as source for their credit contribution!) but still didn't create the Kanji names for the corresponding person. But it's never too late to fix that: Let dormcat specify a list of domains of known good Kanji information (D2_STATION, Knight, Shirayuki, Japanese Wikipedia, jmdb, maybe even Allcinema that at least has Hiragana for many persons; perhaps create a sticky discussion thread for suggestions of those "known good sources"), and then compute a list of all persons with missing Kanji name but a given source for at least one of their credits that matches an entry of this domain whitelist. Normal contributors could then work with this list to supply those missing Kanji names (which is much easier than checking which of those many sources would have Kanji data for a particular anime and finding the persons with missing Kanji names), thus increasing the overall quality of the database content while encyclopedists would focus on fixing the real problems.
The more information about a person you have, the more consistency checks you can perform. That's why a higher percentage of Kanji names would be beneficial for data quality while being almost irrelevant for normal users. (How much would it cost to display this percentage as a simple number somewhere? Script-based, of course. It might be nice to have a "water level indicator" of sorts where one can observe the progress of the Kanjification Crusade. Laughing And of course people without any credits - former duplicates that remained for technical reasons - should not be counted for this percentage, i. e. an encyclopedist fixing a duplicate persons issue would increase the Kanji percentage value in the progress.)
Back to top
View user's profile Send private message Visit poster's website My Anime
dormcat
Encyclopedia Editor


Joined: 08 Dec 2003
Posts: 9902
Location: New Taipei City, Taiwan, ROC
PostPosted: Thu Mar 26, 2009 4:44 am Reply with quote
Devil Doll wrote:
The ANN software could do that by simply trying all available split positions and check the results; that would certainly better than, say, selecting the first entry of the list of names for these Kanji from EDICT, and it would allow Kanji-only sources to be reasonably evaluated for anime where Roumaji sources are rare or non-existent.

You mean, having a database like Neko no Namae, Hito no Namae? Of course I'd like to have one, but it would be very impractical.

Devil Doll wrote:
And remember that it would still be a suggestion only, much like you give a warning for different Roumaji writings of the potentially same person.

I'm afraid to tell that, base on my experience in the past five years working on the Encyclopedia, the average intelligence level of anime fans is LOWER than general user-contributed databases e.g. Wikipedia. People tend to ignore warnings, let alone fuzzy logics designed to help people telling similar names apart. Those who are smart enough would do the work themselves.

Devil Doll wrote:
My general impression is that the current GUI for contributing credits is unfriendly to Kanji input in a number of ways. Such as: Kanji names of persons not found in the database lead to this person being rejected

They are not rejected; it simply tells you the kanji name does not exist in the system. Yet.

Devil Doll wrote:
Roumaji names of persons not found in the database lead this person being created (regardless of potential translation errors - so Roumaji are more trustworthy than Kanji?)

I'm afraid that the majority -- and I mean >90%, absolute majority -- of users are not capable to read kanji or utilize Japanese sources for submission. Rather, they depend on English sources (fansub and/or DVD credits in particular). Furthermore, this is an English website, thus main entries have to be in English; have you seen kanji entry titles in the English version of Wikipedia?

Devil Doll wrote:
while the contributor isn't even given a chance to submit the Kanji name as well.

Sure yes -- only AFTER the new entry has been created.

Quote:
Neither in the feedback message ("person ... has been created") nor in the Anime page where the credit was given for you can even find a link to that person as to edit him/her (as this person has one credit only at this stage), and the ANN Google search won't find the new person either because the index needs to be refreshed first... apparently the shortest way to enter Kanji for that new person's names is to start an error report (!) for the credit, as that's the page where a link to the person is always visible.

In fact I've complained the same problems many years ago, but Dan replied that if a person has only one task/role, it would not be necessary to access its entry via the anime page. Weird thing is that new companies can be accessed even if only have one appearance.

Devil Doll wrote:
Despite the severe warning by dormcat to be very aware of what you're doing while creating new persons, the GUI willingly accepts any rubbish as Roumaji name (despite your plausibility checks)

I'm afraid that is a common pitfall of any user-contributed database.

Devil Doll wrote:
I might have missed a number of such duplicates while adding Kanji names for already existing persons because the only way of reliably finding them would have been to fake giving them another credit (which I didn't do as the credit I was inspecting was already there, and it would have been a lot of additional work).

The fastest way would be using the search engine located at upper right corner. Hit the downward arrowhead and click on encyclopedia.

Devil Doll wrote:
I would dream of the ANN database having Hiragana as an additional third name of Japanese persons, and automatically generating Roumaji names from these (so that the decision which Roumaji to be used at ANN could be changed within that conversion code instead of manually renaming 75000 persons).

You know what my dream is? Switching the display language of the ENTIRE WEBSITE with a single click. Isn't that more convenient?

Devil Doll wrote:
My wild guess is that it would work for 90% of the existing persons.

One word: no. Just use the name you provided: while 吉田麻子 is very likely spelled as "YOSHIDA, Asako," you can't rule out the possibility of "YOSHITA" and "Mako," for the family and the given name, respectively. This is just a tip of an iceberg; many, many examples are out there.

Devil Doll wrote:
I think these are only a few situations (certain "o"/"ou" situations and the "n" being used as vowel, exactly those you named and fewer than your current mapping between different Romanizations). For example, doing a search for "にち" as substring on EDICT tells me that while there are in fact a number of names that contain these Hiragana all of these begin with them. Which makes me believe that "nichi" should always be romanized to "n'ichi" if (and only if) it's not the beginning of a name. This might even be helpful for your already existing Roumaji-based similarity logic. (Sato/Satou is a more difficult issue that can only be handled with Kanji knowledge available, i. e. on a different level of consistency checking.)

Erm, ANN is not a language institute. Anime smallmouth + sweatdrop It is fine to give a few brief hints and reminders, but I don't think it's necessary to invest so much fine-tunings just to warn users. For expert users they might be as annoying as UAP of Windows Vista.

A very big problem is that either hiragana or romaji is not widely known for names of secondary staffs. Only press releases for international media would contain official romanized names.

Devil Doll wrote:
Just one more idea, as it fits into this discussion: There must already have been many contributors who were prevented from doing the best possible input (see GUI discussion above), as they had a Kanji source available (they entered it as source for their credit contribution!) but still didn't create the Kanji names for the corresponding person. But it's never too late to fix that: Let dormcat specify a list of domains of known good Kanji information (D2_STATION, Knight, Shirayuki, Japanese Wikipedia, jmdb, maybe even Allcinema that at least has Hiragana for many persons; perhaps create a sticky discussion thread for suggestions of those "known good sources"), and then compute a list of all persons with missing Kanji name but a given source for at least one of their credits that matches an entry of this domain whitelist.

Good idea, but not without problems: 1) while those sites are almost perfect, they still contain errors from time to time 2) they still don't have any hiragana for secondary staff; you still have to either guess or acquire it from somewhere else.

Devil Doll wrote:
while encyclopedists would focus on fixing the real problems.

Heck, I can't even keep my goal of adding / approving ten or more new titles a day. Anime dazed

Devil Doll wrote:
The more information about a person you have, the more consistency checks you can perform. That's why a higher percentage of Kanji names would be beneficial for data quality while being almost irrelevant for normal users.

In fact if the entire database is written in kanji then it would be easier on me... but that's another story.
Back to top
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger ICQ Number My Anime My Manga
Dan42
Chief Encyclopedist


Joined: 02 Jan 2002
Posts: 3794
Location: Montreal
PostPosted: Thu Mar 26, 2009 8:50 am Reply with quote
dormcat wrote:
Quote:
Neither in the feedback message ("person ... has been created") nor in the Anime page where the credit was given for you can even find a link to that person as to edit him/her (as this person has one credit only at this stage), and the ANN Google search won't find the new person either because the index needs to be refreshed first... apparently the shortest way to enter Kanji for that new person's names is to start an error report (!) for the credit, as that's the page where a link to the person is always visible.

In fact I've complained the same problems many years ago, but Dan replied that if a person has only one task/role, it would not be necessary to access its entry via the anime page. Weird thing is that new companies can be accessed even if only have one appearance.

Then rejoice! For at long last all names are now linked to the person's page (try hovering on the names that appear not linked)
Back to top
View user's profile Send private message Visit poster's website AIM Address My Anime My Manga
dormcat
Encyclopedia Editor


Joined: 08 Dec 2003
Posts: 9902
Location: New Taipei City, Taiwan, ROC
PostPosted: Thu Mar 26, 2009 9:02 am Reply with quote
Dan42 wrote:
Then rejoice! For at long last all names are now linked to the person's page (try hovering on the names that appear not linked)

Hooray! Very Happy Laughing
Back to top
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger ICQ Number My Anime My Manga
Devil Doll



Joined: 07 Jul 2007
Posts: 656
Location: Germany
PostPosted: Thu Mar 26, 2009 1:01 pm Reply with quote
Actually I considered the different handling of people with only one credit a good idea (making the visitor aware of this information probably being of lower reliability), and would have been fine with getting that link only on the feedback page of the person's creation process, for that's the moment I want to enter the Kanji for them, as the crediting form doesn't have fields for entering both Roumaji and Kanji at the same time.
Getting the link on the anime page is only the second best solution as it requires me to 1. reload that page in order to get the link and 2. scroll down to that particular credit, thus causing unnecessary database access to hundreds of information fields as well as additional mouse activity.

dormcat wrote:
Those who are smart enough would do the work themselves.
Is that a reason to make life unnecessarily hard for them? Shouldn't those who can (and want to) use them be offered any additional tools that can be provided with a reasonable cost-to-effect ratio? Those who are smart enough to do the work themselves are the ones aware of how the work could be done more efficiently... given the right tools.

dormcat wrote:
Devil Doll wrote:
My general impression is that the current GUI for contributing credits is unfriendly to Kanji input in a number of ways. Such as: Kanji names of persons not found in the database lead to this person being rejected
They are not rejected; it simply tells you the kanji name does not exist in the system. Yet.
I can create persons with Kanji names only? Look at this screenshot:

I don't see a way to create those persons without giving them a Roumaji name first.

dormcat wrote:
Devil Doll wrote:
Roumaji names of persons not found in the database lead this person being created (regardless of potential translation errors - so Roumaji are more trustworthy than Kanji?)

I'm afraid that the majority -- and I mean >90%, absolute majority -- of users are not capable to read kanji or utilize Japanese sources for submission. Rather, they depend on English sources (fansub and/or DVD credits in particular). Furthermore, this is an English website, thus main entries have to be in English; have you seen kanji entry titles in the English version of Wikipedia?
First of all, 90% of the users may not be able to read Kanji, okay. But are these the contributors? Isn't it more likely to find a contributors amongst the other 10%?
And yes, the better articles of English Wikipedia use more and more Kanji these days (see Hiromi Tsuru for a really random example).

dormcat wrote:
Devil Doll wrote:
I might have missed a number of such duplicates while adding Kanji names for already existing persons because the only way of reliably finding them would have been to fake giving them another credit (which I didn't do as the credit I was inspecting was already there, and it would have been a lot of additional work).
The fastest way would be using the search engine located at upper right corner. Hit the downward arrowhead and click on encyclopedia.
I don't trust the ANN Google search for this function. (That was the very reason to start this thread!) It doesn't work on the same data set, and it frequently doesn't find people with Kanji names that the crediting dialog would find. Believe me, there's a reason why I suggest using the duplicate finding routine of the crediting dialog for the edit-person-kanjiname function. It's not just the number of mouse clicks.

dormcat wrote:
Devil Doll wrote:
I would dream of the ANN database having Hiragana as an additional third name of Japanese persons, and automatically generating Roumaji names from these (so that the decision which Roumaji to be used at ANN could be changed within that conversion code instead of manually renaming 75000 persons).
You know what my dream is? Switching the display language of the ENTIRE WEBSITE with a single click. Isn't that more convenient?
No, and it isn't even the discussion topic. I am not interested in having a German GUI for ANN. (I strongly discouraged the server admin of that German database who created an English output language feature exactly like you dream of it, because what will this feature give you as long as all the content - summaries, character descriptions, reviews, forum threads - still remains in German?)
We're talking about database content quality, not about automatic Babelfishing whole sites into any number of languages. That's not a dream (which would be worth gambarizing), that's an LSD trip. Hiragana as internal romanization normalisation language (and only for this purpose!) could be implemented in a matter of weeks, along with the set of tools making use of this concept.

dormcat wrote:
Devil Doll wrote:
My wild guess is that it would work for 90% of the existing persons.
One word: no. Just use the name you provided: while 吉田麻子 is very likely spelled as "YOSHIDA, Asako," you can't rule out the possibility of "YOSHITA" and "Mako," for the family and the given name, respectively. This is just a tip of an iceberg; many, many examples are out there.
I didn't say for 90% of the names, only for 90% of the persons. (By the way, that's why I asked whether encyclopedists have a feature for tagging obscurely looking information as "this is not wrong".)

dormcat wrote:
Erm, ANN is not a language institute. Anime smallmouth + sweatdrop It is fine to give a few brief hints and reminders, but I don't think it's necessary to invest so much fine-tunings just to warn users. For expert users they might be as annoying as UAP of Windows Vista.
ANN is a database working with the premise commit-then-review. And you are the reviewer. Should you not be interested to give the contributors any tools at hand to reduce your own workload of fixing errors? And if an additional feature might be considered annoying them make it conditional, based on a setting in the configuration of each ANN user. There's a solution for all your objections.

dormcat wrote:
A very big problem is that either hiragana or romaji is not widely known for names of secondary staffs. Only press releases for international media would contain official romanized names.
True - for English media. That's exactly where D2_STATION and the like are better, most notably for staff (as most cast persons have their own Japanese Wikipedia entry with both Kanji name and Hiragana reading). I'm all with you in naming the problem; I'm just suggesting to use better sources, and to make life easier for those contributors who are able to use these sources. And if an average staff person has 10 credits, then let 10 Americans enter those credits from the U.S. DVDs and one of the 10% able to read Kanji add the Kanji names, to get an additional concept of finding duplicate kanji names, and very likely duplicate persons in the process.
Given the task translation table that's currently being filled, and the multi-input mode, we're actually not far away from going to a page like this one, selecting the content with a mouse, and cut&paste it into ANN. Can you see how much that would speed up the contribution process, while at the same time avoiding any romanization issues? It all depends on the matching rate of the Kanji names for persons.

dormcat wrote:
Good idea, but not without problems: 1) while those sites are almost perfect, they still contain errors from time to time 2) they still don't have any hiragana for secondary staff; you still have to either guess or acquire it from somewhere else.
Yes, they contain errors (I found some even at D2_STATION. But is the average error frequency higher or lower than using the DVD cover credits of U.S. anime releases who translated the names to Roumaji?
And for the second issue: Yes, that's a problem - as long as too few ANN person entries have Kanji names. The more Kanji names, the higher the matching rate. Everything gets better with a higher percentage of Kanji names. And that's exactly what I did for The Yamadas: I took several dozens of already existing persons with Roumaji names only and no Kanji names, took a good Kanji source, and manually matched them to fill in the missing names. The more of these tasks are performed, the easier will it become to contribute Kanji-only sources. Actually the matching rate is surprisingly high for Suzuka (about 80%, as most staff people have several other credits already and at least one of them caused them to get a Kanji name) while it was rather low for The Yamadas (below 50%, as those credits were based on an English-only source).

dormcat wrote:
Devil Doll wrote:
while encyclopedists would focus on fixing the real problems.
Heck, I can't even keep my goal of adding / approving ten or more new titles a day. Anime dazed
And that's exactly why I want contributors not to create hundreds of duplicate persons (that you will have to fix as well), by giving them better consistency checks during the contribution process already. Your time is more precious than mine, as you're the encyclopedist and I'm the contributor.

Devil Doll wrote:
The more information about a person you have, the more consistency checks you can perform. That's why a higher percentage of Kanji names would be beneficial for data quality while being almost irrelevant for normal users.
In fact if the entire database is written in kanji then it would be easier on me... but that's another story.[/quote]If the entire database were written in Kanji only then how would we know how to pronounce all those names? Kanji plus Hiragana (and optionally auto-generated Roumaji from those Hiragana, for the English only users), that would be it... for me.
(You now see why I was asking about primary key structures and the like... in order to not ask for the impossible. Dan can't and won't reinvent the wheel from scratch, that's why I have to understand the existing model as good as possible to become able to suggest easy to implement improvements, as his time surely is one of the scarcest resources here.)
Back to top
View user's profile Send private message Visit poster's website My Anime
dormcat
Encyclopedia Editor


Joined: 08 Dec 2003
Posts: 9902
Location: New Taipei City, Taiwan, ROC
PostPosted: Thu Mar 26, 2009 3:50 pm Reply with quote
Devil Doll wrote:
I don't see a way to create those persons without giving them a Roumaji name first.

Okay, I see what you mean. However, I still see no point of entering all those names in kanji while we don't yet have any trustworthy way to transliterate them into romaji. Remember that the primary language of this site is English and Latin letters; kanji is the supplementary information. You are welcome to help me entering info should there be an ANN.tw that uses kanji first and romaji as second. Wink

Devil Doll wrote:
And yes, the better articles of English Wikipedia use more and more Kanji these days (see Hiromi Tsuru for a really random example).

Huh? The only kanji I saw was her name. What's your point?

Devil Doll wrote:
I don't trust the ANN Google search for this function. (That was the very reason to start this thread!) It doesn't work on the same data set, and it frequently doesn't find people with Kanji names that the crediting dialog would find.

I'm VERY disappoint of you now. First, Dan just unlocked the feature so even names with just one cast or staff credit can be accessed through an anime/manga page, yet you seemed to have ignored his announcement. Second, you have enough time and energy to type all these long posts, yet you haven't even tried to use the three different search modes of the upper right button: the embedded Google search is available only when you click on "search" directly; if you click on the downward arrow and you could use the other two modes that are different and have nothing to do with Google at all.

Devil Doll wrote:
ANN is a database working with the premise commit-then-review. And you are the reviewer. Should you not be interested to give the contributors any tools at hand to reduce your own workload of fixing errors?

Actually, no. Surprise, surprise! Laughing In fact I prefer another system of quality control, but I'd rather not discuss about it here.

Devil Doll wrote:
And if an additional feature might be considered annoying them make it conditional, based on a setting in the configuration of each ANN user. There's a solution for all your objections.

"Conditional?" Do you mean "optional?" I'm not sure what your "based on a setting in the configuration of each ANN user" means; that an user can choose whether to use the hiragana submission mode or not?

Devil Doll wrote:
True - for English media. That's exactly where D2_STATION and the like are better, most notably for staff (as most cast persons have their own Japanese Wikipedia entry with both Kanji name and Hiragana reading).

I think I'm starting to understand what you're really into: instead of automated transliteration, you'd rather have the entire database based on kanji/hiragana instead of romaji. Am I interpreting correctly?

Devil Doll wrote:
And if an average staff person has 10 credits

I know you're just making an example, but I don't know whether should I laugh or cry.

Devil Doll wrote:
Given the task translation table that's currently being filled, and the multi-input mode, we're actually not far away from going to a page like this one, selecting the content with a mouse, and cut&paste it into ANN. Can you see how much that would speed up the contribution process, while at the same time avoiding any romanization issues? It all depends on the matching rate of the Kanji names for persons.

Have you performed any experiment, or you just imagined those "improvements" with no proof whatsoever? I took 平川亜喜雄 as a test subject, as I haven't heard of this animation director. Based on my knowledge I made an educated guess that the furigana of his name should be ひらかわあきお, so I searched it together with his kanji name using Google. Guess what? Nothing came out. Then I was forced to use "Akio Hirakawa" instead (still with kanji), and five results popped out. Not surprisingly, ANN was the #1. Funny enough, if I used katakana ヒラカワアキオ then four results came out, yet none of them actually has the katakana; Google either translated it into romaji or simply ignored it.

Devil Doll wrote:
Actually the matching rate is surprisingly high for Suzuka (about 80%, as most staff people have several other credits already and at least one of them caused them to get a Kanji name) while it was rather low for The Yamadas (below 50%, as those credits were based on an English-only source).

Because there are much fewer people on the credit roll of Suzuka, probably less than 1/5 of those on any Ghibli movie. It would be much easier to obtain the complete staff list for a Ghibli movie: just watch the ending. TV anime usually don't list everyone; sometimes they do so at the end of the entire series, but not every finale has enough time for that.

Devil Doll wrote:
And that's exactly why I want contributors not to create hundreds of duplicate persons (that you will have to fix as well)

Why should I? Razz While it's unlikely I'd quit working on this database, I'm not bound by a contract or whatnot, thus I can work with my own pace and style.

Devil Doll wrote:
Your time is more precious than mine, as you're the encyclopedist and I'm the contributor.

The more I read your posts, the more I feel that you're simply trying to persuade us to remodel the Encyclopedia so you can submit as much info in the shortest time. Sounds like a win-win situation right? Not quite so, as you don't have to invest your time on technical and bibliographical researches and investigations.
Back to top
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger ICQ Number My Anime My Manga
Dan42
Chief Encyclopedist


Joined: 02 Jan 2002
Posts: 3794
Location: Montreal
PostPosted: Thu Mar 26, 2009 7:19 pm Reply with quote
Yeah, it seems like a link to the person's profile would be both useful and easy to do on the info-was-added page, so I'll add that right away.

You are correct that people cannot be created with Kanji names only, and that's how it should be and not about to change. 99% of the people reading this website cannot read Kanji names, and so all Japanese names MUST be created with romaji, by necessity. Making life easier on the contributors is good, but not if it makes life harder on everyone else. I would not consider an encyclopedia chock-full of kanji to be good or useful from the perspective of an English reader.

Honestly I don't think I understand what is your point. You seem very enthusiastic about this hiragana normalisation model, but I don't see what advantage it would provide. And please don't just say "it's better" without quantification; that's meaningless. Also, auto-converting kanji to hiragana/romaji (even if it is as "just a suggestion") will definitely speed up input from Japanese sources, but I fail to see how it would reduce duplicates. Please give me concrete examples to illustrate your theories.
Back to top
View user's profile Send private message Visit poster's website AIM Address My Anime My Manga
Devil Doll



Joined: 07 Jul 2007
Posts: 656
Location: Germany
PostPosted: Thu Mar 26, 2009 8:36 pm Reply with quote
Dan42 wrote:
I fail to see how it would reduce duplicates. Please give me concrete examples to illustrate your theories.
I believe that the duplicate detection would work a lot more reliably based on both Kanji and Hiragana names than it can ever work based on Roumaji names because 1. the ANN users are not using one unique Romanization method and 2. Roumaji names can be subject to both a) Kanji reading errors and b) spelling errors while 2a) is impossible for Kanji names and 2b) quite a few spelling errors end up leading to Roumaji names of Japanese people (and they're still the main part here, right?) that can't even be written in Hiragana. (Such as most cases of two consecutive consonants, or a name ending with a consonant other than "n".)
The probability of "Akira Sato" and "Akira Satou" being duplicates is much lower (because both "Akira" and "Sato[u]" can be Kanji reading errors resp. random romanizations) than when you have the Hiragana (to tell apart Sato from Satou) or even the Kanji (to tell apart Akira from Akira when both may be one of 20 possible readings).
And the sooner a duplicate can be identified the sooner can it be removed before Roumaji-only users are faced with the decision to randomly assign their new credits to one of these persons (which your similarity logic might even show them but they're unable to handle the situation).
I've faced this decision several dozen times in the last few days and often didn't know what to do. Should I search for an already existing person in the database whose Roumaji name would be a possible reading for the Kanjis I have and who was already credited for the same tasks that my new cast has? And should I enter the Kanji for that person when they're not yet in the database? I tried to avoid this situation whenever I could (i. e. I first checked the sources of the person in question, hoping for a Kanji-only source whose contributor simply forgot to add the person's name in Kanji as well - and if I found such a source then I did what that previous contributor would have done had the GUI not kept him/her from doing so. But in many case none of the existing credits had any usable source. What should I do now? Ignore my kanji source credits? Or bet that I found the correct person in ANN?
As a side effect, this operation led to several dozen reports of potential duplicate persons in cases when I could prove that more than one person in the ANN database had identical Kanji names based on at least one source (because I found those names based on already existing sources) and compatible tasks being credited for and compatible Roumaji readings. I would rather not want to report duplicates based on Roumaji names plus tasks alone (but I did that as well in a few cases).

Dan42 wrote:
You are correct that people cannot be created with Kanji names only, and that's how it should be and not about to change. 99% of the people reading this website cannot read Kanji names, and so all Japanese names MUST be created with romaji, by necessity.
You're right that persons with Kanji-only names must not be visible to the majority of ANN users.
But is it really impossible to store them? Couldn't they (plus the corresponding credits) be contributed somehow, and only activated once the corresponding person has been created the normal way? I know that concept from another anime-related database where they use that method for other data (video file attributes), they're stored under some unique key (more unique than a Kanji name) and activated once the object they belong to happens to be created.
But I am not telling you to implement that; it wouldn't be worth the effort unless the percentage of Kanji names in ANN would rise first.

dormcat wrote:
I'm VERY disappoint of you now. First, Dan just unlocked the feature so even names with just one cast or staff credit can be accessed through an anime/manga page, yet you seemed to have ignored his announcement.
You're apparently not reading my postings. I even commented on that change (that it only was my second choice, compared to setting the link on the contribution feedback page which would spare the database from me reloading the whole anime page with hundreds of data fields).

dormcat wrote:
I think I'm starting to understand what you're really into: instead of automated transliteration, you'd rather have the entire database based on kanji/hiragana instead of romaji. Am I interpreting correctly?
I wish it were that easy. Actually it's like this: I am trying to understand as much of the existing database structure of ANN to find a way of getting as close as possible to what you want it to be like with as little changes as possible for Dan to implement. I'm trying to find things that are worth actually being implemented in the near future, instead of questioning the whole concept as it is. That's what I had to input 2000 data fields for - to be able to understand how the whole thing works and where its strengths and weaknesses lie; without that it would have been impossible to even make the slightest suggestion. But at least some ideas appear to have been worth thinking about it (such as what contributors would want the GUI to be like if they have Kanji names to offer for the credits they are entering). I'm not a revolutionary.
And my primary focus is consistency checks because they could be done by software instead of real people who never have enough time for additional (and boring) tasks. I'm trying not to put any additional additional workload on any encyclopedist (with the exception of the programmer). On the contrary, I would welcome any check that would save the encyclopedists any amount of work. (Given the encyclopedists still believe they will ever be able to handle all incoming duplicate person reports. If you gave up this goal a long time ago then I won't be able to achieve anything significant.)

dormcat wrote:
Devil Doll wrote:
And if an average staff person has 10 credits
I know you're just making an example, but I don't know whether should I laugh or cry.
I'm not talking about current ANN data here. I'm rather talking about something closer to reality, i. e. a lot more staff data for more animes (based on more Kanji data as low-level animator data will rarely be available in Roumaji) and quite a few less people (after purging many duplicates). 10 may still be optimistic, I know - it referred to the 90% of Kanji illiterate ANN users. If Kanji literate ANN users would contribute twice as much than the rest, then 5 credits on average would suffice.

dormcat wrote:
TV anime usually don't list everyone; sometimes they do so at the end of the entire series, but not every finale has enough time for that.
You're right, the TV series doesn't have enough time. But... where does D2_STATION then get lots of credits staff credits for Suzuka TV from that aren't visible in the TV anime? Had I not compared these two sources I wouldn't have found a number of duplicates I reported.

dormcat wrote:
Devil Doll wrote:
And that's exactly why I want contributors not to create hundreds of duplicate persons (that you will have to fix as well)
Why should I? Razz While it's unlikely I'd quit working on this database, I'm not bound by a contract or whatnot, thus I can work with my own pace and style.
Let me rephrase: "...that either you will have to fix or won't be fixed ever because no one except you has the skills and the authorization to fix them." (Making any postings about duplicate persons completely meaningless then.)

dormcat wrote:
The more I read your posts, the more I feel that you're simply trying to persuade us to remodel the Encyclopedia so you can submit as much info in the shortest time. Sounds like a win-win situation right? Not quite so, as you don't have to invest your time on technical and bibliographical researches and investigations.
Read my duplicate person reports. You can see how much I already contributed in technical and bibliographical researches myself. If that's not enough, then please teach me which additional information you need to handle a duplicate person with a minimum of effort, which sources you trust most and the like. You wrote a lot of very good postings about how to contribute data to ANN.
The funny thing is: Actually I don't want to contribute data to ANN. That's not exactly why I came here. I want to use the database first and foremost. I'm using it for identifying persons whom I only have Kanji data for. I do that for a long time already. I wanted to pay back part of my dues by contributing data from good sources for persons who obviously have data from less reliable sources (The Yamadas, remember? I haven't even seen this anime, it just looked like an excellent example to compare English-only sources with a Kanji source). And the more I work with ANN, the more convinced I am that persons without Kanji names and one credit only tend to be Romanization guesses that aren't better than what I could guess with EDICT. Guessing with the ANN Google search and the correctly split Kanji (or faking a credit contribution) is already much better... given enough persons have a Kanji name. That's why I am contributing Kanji names: Because they make ANN as a whole more valuable.

You are free to simply ignore all that I'm writing here. But if some parts of my reasonings happen to be helpful for ANN as a whole, please don't ignore them because they came from me.


Last edited by Devil Doll on Thu Mar 26, 2009 9:36 pm; edited 1 time in total
Back to top
View user's profile Send private message Visit poster's website My Anime
Dan42
Chief Encyclopedist


Joined: 02 Jan 2002
Posts: 3794
Location: Montreal
PostPosted: Thu Mar 26, 2009 9:30 pm Reply with quote
Devil Doll wrote:
I believe that the duplicate detection would work a lot more reliably based on both Kanji and Hiragana names than it can ever work based on Roumaji names because the ANN users are not using one unique Romanization method. The probability of "Akira Sato" and "Akira Satou" being duplicates is much lower

You have stated and re-stated your beliefs several times but have yet to provide a convincing example. Instead of the Sato/Satou confusion we'd have the さと/さとう confusion. How does that make things better?

Devil Doll wrote:
You're right that persons with Kanji-only names must not be visible to the majority of ANN users.
But is it really impossible to store them? Couldn't they (plus the corresponding credits) be contributed somehow, and only activated once the corresponding person has been created the normal way?

Hum... sorry but I'm going to put my time on more urgently needed features.
Back to top
View user's profile Send private message Visit poster's website AIM Address My Anime My Manga
Display posts from previous:   
Reply to topic    Anime News Network Forum Index -> Site-related -> Encyclopedia All times are GMT - 5 Hours
Goto page 1, 2  Next
Page 1 of 2

 


Powered by phpBB © 2001, 2005 phpBB Group