×
  • remind me tomorrow
  • remind me next week
  • never remind me
Subscribe to the ANN Newsletter • Wake up every Sunday to a curated list of ANN's most interesting posts of the week. read more

Forum - View topic
Avoiding kanji name collisions for duplicate persons without




Anime News Network Forum Index -> Site-related -> Encyclopedia
View previous topic :: View next topic  
Author Message
Devil Doll



Joined: 07 Jul 2007
Posts: 656
Location: Germany
PostPosted: Tue Jun 15, 2010 8:14 pm Reply with quote
While contributing kanji names to persons I frequently find persons with the same kanji name as another person already existing in ANN. Occasionally (i. e. for many seiyuu and some higher staff tasks) one of these persons is an obvious duplicate and easy to tag as such; more often (i. e. for less-known seiyuu and most lower staff tasks) it is impossible to state which of the persons has the correct romanization, and due to this lack of information it's also useless to report them in the "Duplicate persons" thread so ANN has to live with them for the time being.
These kanji name collisions for duplicate persons without a known romanized name are the topic of this posting.

When facing this kind of situation I have two obvious options:
  • Contribute the kanji name for each of these persons as it certainly wouldn't be a wrong information - each of these "persons" actually has this kanji name as backed up by the given sources (most of which have already been cited when contributing the credit, I rarely have to rely on own sources for adding kanji names). But this would create kanji name duplicates for each of them, slowing down the contribution process from kanji-only sources, i. e. the exact opposite of what I consider helpful.
  • Refrain from contributing the kanji name for all but one person, which would mean to not contribute a kanji name that already exists at ANN. But this would mean to keep away potentially helpful information from the ANN database (and even information from good sources that are easy to verify).
For a while I chose the first of these two options, rating the additional information higher than the contribution speed. But the more of these cases I find, the more I'm longing for a solution of this conflict.

Actually, with the given contribution mechanisms I can easily bypass this problem by entering the kanji name in a way the contribution function can't handle for translating into romanized names. This way the kanji name would be readable for visitors (and have a verifiable source URL) but not cause kanji name collisions.

I have done so for a handful of persons now. The syntax I chose was to set the given name in round brackets (the contribution form allows this kind of input). Example:
Miyako OHTA = "太田 都" (4 credits for animation)
Miyako OOTA = "太田 ()" (1 credit for animation)
Whoever contributes a credit for 太田都 will get that name translated to Miyako OHTA; whoever visits the Miyako OOTA page will see the kanji name of this person, perhaps just looking slightly awkward. So this would solve the problem without any of the disadvantages of the two other options.

I'm fully aware that this field wasn't meant to take this kind of information (and I'll immediately stop this technique when an encyclopedist will tell me to do so - this is why I'm posting this text here in the forum). The number of persons I tagged this way is small (something like 3 so far IIRC but I don't have a complete list of them; they could easily be found by Dan42 with a SQL statement searching for "(" in the kanji given name; ANN Google search for "Given name (in kanji): (" does not work as it ignores the trailing "(" character).

Comments of any kind to this situation would be more than welcome.

In case I'd get an "okay" for this technique I'd like to begin to also change kanji given names of persons for which I created kanji duplicates in the past (whenever I happen to find one of them - unfortunately I don't have a list of these either; a simple report listing all kanji-duplicate-persons at ANN would be a valuable tool for me).
Back to top
View user's profile Send private message Visit poster's website My Anime
Dan42
Chief Encyclopedist


Joined: 02 Jan 2002
Posts: 3791
Location: Montreal
PostPosted: Thu Jun 17, 2010 12:25 am Reply with quote
The example you chose above was probably not the best one, because Ohta/Oota is only a romanization difference. As stated before, this is the one and only case where merging names is the appropriate solution. Same reading, same kanji, same type of task... that's a clear duplicate.

The real problem is when two readings have the same kanji. Say, Emi Kadono and Emi Kakuno both have 角野 江美. If it's the same person then one of the readings is wrong. Which one? If you know which is incorrect, you shouldn't submit the kanji under than incorrect reading. But if you don't know, then how can you choose with which reading to use your little trick? If you choose to write "(角野) 江美" under Emi Kadono, then the name will always be auto-translated to Emi Kakuno and it's effectively the same as if you had chosen Emi Kakuno as the correct reading.

So basically, I don't see a case in which it makes sense to use this trick. Do you have a better example?
Back to top
View user's profile Send private message Visit poster's website AIM Address My Anime My Manga
Devil Doll



Joined: 07 Jul 2007
Posts: 656
Location: Germany
PostPosted: Thu Jun 17, 2010 2:18 pm Reply with quote
Dan42 wrote:
The real problem is when two readings have the same kanji. Say, Emi Kadono and Emi Kakuno both have 角野 江美. If it's the same person then one of the readings is wrong. Which one? If you know which is incorrect, you shouldn't submit the kanji under than incorrect reading. But if you don't know, then how can you choose with which reading to use your little trick?
By likelihood. This is better than the current coin toss. You believe we have two cases (1. we know, 2. we don't); actually we have three cases (1. we know, 2. we don't and several options have a similar likelihood, 3. we don't and one option is clearly more likely than others). Merging persons is difficult to be undone so it will require case 1; but does this mean we have to treat case 2 and 3 equally even though the "bracket tagging" is much easier to undo?
What's more, I already have to make this judgment for each and every kanji contribution to a kanji duplicate anyway. (And seeing the blue icons highlighting my previous contributions I "remember" my decision from back then.) The only thing that would change is that I wouldn't have to run through the same procedure over and over again.

There might be lots of criteria to choose one of these names. For example: ANN has 3 角野 persons, the two you named plus Eiko KADONO 角野 栄子, the original story author of Kiki's Delivery Service (backed up by 3 news references). So while we don't have any solid proof for 角野 江美 to be either Emi KADONO or Emi TSUNO (and thus can't make a merge), we have a tendency of KADONO being the more likely candidate.

Another way would be: Look at the other persons named KADONO resp. TSUNO and choose the one where the kanji were used in a higher percentage of cases. ANN has 3 other KADONO (1 角野, 1 門野, 1 上遠野) and 9 other TSUNO (5 津野, 1 つの, 3 unknown). So ANN has three times as many TSUNO as KADONO, and yet not a single TSUNO with 角野 as kanji name; more than half of all TSUNO have the same kanji name 津野 and there's not a single known kanji alternative in practical use (as opposed to ENAMDICT) to this in the ANN database. So again, my tendency would go towards KADONO while this would certainly not be sufficient for a merge request. You may consider this way of judging the situation not that reliable but I have to do something like this for each and every split request as well. And splits are much more difficult to undo than this "bracket tagging", yet you still go by the most likely name for these.

A third one would be: One person has 1 credit, the other one has 19 credits. So amongst the translators who submitted those 20 credits, 95% were of the same opinion for the reading. Would you consider this enough for a merge if we talk about an in-between animator for whom we have no source URL with both kanji and romaji name? How much "majority opinion" would a merge require? 90%? 80% 70%? I'd happily take a 70% tendency amongst 10 sources as indicator for the bracket tagging (or "anonymous temporary merge" if you like) ... but would ask for more in case of a full merge.

A fourth one would be: Post a "bracket tagging required" thread here in the encyclopedia forum, with links to all involved persons and an explanation why one of the readings is considered the preferred one. This way it would be you setting the brackets, and only when the poster convinced you. It would of course require you to read those threads.

I mentioned this a number of times before but as it fits here, I'll repeat it: I believe ANN should make kanji name duplicates more visible. For example, the Emi KADONO page should have a link to Emi TSUNO (placed next to the new kanji name string of Emi Kadono, with a text such as "see also: Emi TSUNO". This would be less than a merge (which apparently requires a proof of the romanized name) but more than the current situation (as it directs a visitor of the Emi Kadono page to further credits that are probably the same person, regardless of the name). This would document what ANN exactly knows about these credits: There were given to 角野 江美, a person whose romanized name may be Emi KADONO or Emi TSUNO, which one of both we don't know.
Pre-selecting one preferred romanization has a tendency to separate the credits unequally amongst both persons (note that it only influences kanji contributions, not contributions from English sources!) as opposed to separating them randomly like the current situation. But what's the real difference? Neither of both mechanisms is perfect but my mechanism makes life for (high-quality) kanji contribution easier. Plus linking both kanji name duplicates together (like I described above) would even make the superset of "their" credits more visible to the audience, and as such be an improvement. (Not to mention that displaying the link to kanji duplicates would allow me to see the duplicate in the very moment when I enter the second instance of the kanji name, something for which I currently have no reliable tool for as the Google search function works with cached data and the ANN name search doesn't support kanji input.)

The reason why I'm discussing this matter in epic length is because I consider my "solution" not really elegant. I would prefer all 角野 江美 persons to have their kanji (and from these the links between them easily auto-generated) plus one of them being flagged as the preferred reading in case of doubt, which unfortunately would require a change of your data model and thus make an implementation costing your precious time less likely (lots of code to be checked for potential modification).
The contribution dialog would then need a small enhancement: When a kanji duplicate is detected, check whether one of them is being flagged as preferred reading, and if so, then select this one as the only one for the kanji translation mechanism. This way everyone would be happy. (The possibility of more than one kanji duplicate having this flag can be ruled out during the dialog of setting the flag which would first have to clear the flag for all persons of this kanji name before setting it again for the selected person.) Explicitly implement such a flag would also allow to model the situation of known real kanji duplicates (such as a 1960's seiyuu and a 2000's colorist) who would then not automatically link to each other, and being tagged as not being candidates for a merge request.

[science fiction]There would be a way of implementing the whole thing without a change of data model: Specify an exact set of rules (similar to the ones I mentioned above), evaluate these on the fly whenever a kanji duplicate has been detected during a contribution process, and automatically select the "preferred reading" into the contribution form.[/science fiction] Actually not that different from the heuristic you're using for "name similarity"; its downside would be the inflexibility of the rule set and the additional load on the server.

And finally, there would be a different way of handling the whole issue: Check whether the current data model allows to "swap" the relation between merged persons. As of now, the data model apparently attaches person A as an alias of person B; would it be possible to semi-automatically "swap" such a merged couple so that B would be an alias of A? If that were the case then it would be no problem to simply merge Emi TSUNO and Emi KADONO, and "swap" their relation in case we have solid proof about her true name. ANN could then have a lot more merged persons as a merge would be a less "final" decision; then again, there would be different qualities of "also known as".
Back to top
View user's profile Send private message Visit poster's website My Anime
Display posts from previous:   
Reply to topic    Anime News Network Forum Index -> Site-related -> Encyclopedia All times are GMT - 5 Hours
Page 1 of 1

 


Powered by phpBB © 2001, 2005 phpBB Group