Utilika Foundation
4726 11th Avenue Northeast, Suite 512
Seattle, Washington 98105-4681, U.S.A.
http://utilika.org
Revised 6 April 2007
The Semantic Web vision requires authors to create unambiguous content, but whether authors are willing and able to do this even minimally has been questioned. To address this doubt, I conducted a pilot experiment. Subjects saw 25 sentences with scopally ambiguous adverbial quantifiers, such as "almost always". Subjects attempted to disambiguate each sentence by choosing between alternative representations of its possible meaning. Subjects were assigned to either or both of two tasks. In one task, each representation was a paraphrase. In the other, each representation was a situation description (a truth condition).
Subjects, on average, spent about 25 seconds per sentence, exhibited about 75% inter-task consistency, chose meanings by about 75% majorities, and became faster as they got practice. The paraphrasal task was enjoyed somewhat more and was performed faster than the truth-conditional task. The most dramatic influence on speed was presentation order: Subjects performing two tasks per sentence were much faster if the paraphrasal task was first on the page than if the truth-conditional task was first. A pair of tasks in the former order was almost as fast as a single truth-conditional task. Sentences differed substantially in how difficult it was for subjects to disambiguate them consistently and consensually. The most difficult sentences were those with near-universal quantification and multiple categorical arguments and complements. Subjects also showed apparent sensitivity to the order of verb arguments, being biased by whether a cause appeared before or after its effect. Subjects' comments called attention to subtle ambiguities in the instrument and exhibited a wide range of attitudes, from excited discovery to frustrated confusion, toward the tasks.
The results cast doubt on the doubt about humans as disambiguators. Both paraphrasal and truth-conditional disambiguation appear to be viable. They may also be efficiently combinable. Further research can explore task order, trainability and incentivizability for disambiguation tasks, and differences among individuals and sentences relevant to disambiguation task design.
In the envisioned Semantic Web (Berners-Lee et al. 2001), typical users would not spend much time reading Web pages. They would mostly rely on machines to do so for them. Machines would use the Web's content to answer users' questions and take autonomous actions compatible with users' interests.
If this kind of human-machine collaboration were attempted with the Web as it is, the efficiency might come at a price: The quality of the answers and decisions could be unsatisfactory. Ambiguity, vagueness, and underspecification pervade discourses in natural language, including typical texts on the Web. People often know enough to resolve semantic and pragmatic uncertainties in face-to-face communication (Wasow et al. 2005), but Web communication is more cue-impoverished, Web interlocutors tend to have less common ground, and machines are generally inferior to humans in semantic and pragmatic resolution. The Web is also a linguistically diverse medium, and empowering machines to discern intended meanings from text in hundreds of languages would be complicated by the fact that, although the world's languages can apparently express similar sets of meanings (Bittner 1995, Baker 1995, von Fintel and Iatridou 2006), they do so with significantly diverse mechanisms (Partee 1995, Partee 2004 p 220, Petronio 1995, Vieira 1995, Faltz 1995).
For an example of Web ambiguity, a health encyclopedia (United States National Library of Medicine and National Institutes of Health 2007) advises
(1) Avoid prolonged exposure to excessive heat and humidity.
A machine giving advice on the basis of this content might have trouble assigning values to vaguely quantitative modifiers like "prolonged" and "excessive". It might detect lexical and structural ambiguities associated with "and", but not reliably select the correct meaning. Consequently, some automated agents might tell their clients to avoid heat and also avoid humidity, while others were telling their clients that dry heat, and cold humidity, are safe.
To compensate for the limitations of automatic disambiguation, the Semantic Web vision assumes that authors of Web content will benefit from making their works sufficiently unambiguous for machine reasoning and will agree to incur some limited cost to do so. An implicit further assumption is that progress in automated disambiguation (Blanchon 1997, Chantree 2003, Ceccato et al. 2004, Trujillo 1999 ch 9) will be slow enough to make some investment in human disambiguation rational for the foreseeable future.
If authors performed their own disambiguation, they would presumably want to know which disambiguation methods work best: which ones are easiest to use, take the least time to learn and use, and deliver the most useful results.
There are several kinds of disambiguation methods. The craft of effective speech and writing, whether general or technical (O'Conner 1996, Berry et al. 2003), is partly the craft of disambiguation. A more aggressive discipline is to write in controlled languages (Pool 2006), containing lists of ambiguity-preventing prohibitions. An extension of this practice accompanies it with interactive human-machine authorship, in which a machine performs real-time automated fault (including ambiguity) detection and interrogates the author (Bernth 2006, Sammer et al. 2006). In a more directed version of this technique, a machine guides the author, turning composition into selections among alternatives (Power 1999, Power and Evans 2004). In contrast with these uses of constrained natural languages, there are methods relying on formal languages. The earliest ones typically formalized only the lexicon, aiming to make it unambiguous (Maat 1999). Later projects typically changed the emphasis, targeting structural ambiguity (Pietroski 2007, Zalta 2005). The variety of formal-language authoring most associated with the Semantic Web Initiative addresses both lexical and structural ambiguity and uses natural-language-like interfaces to facilitate accessibility (Noy and McGuinness 2001, Fuchs et al. 2006, Schwitter 2005).
A frequent observation about all these methods, by those who teach them to users or analyze their use, is that people often fail--or refuse--to learn them and comply with them. This resistance may arise from the tendency of speakers and writers to be unaware of ambiguities in their own expressions or, if aware, to lack motivation to do anything about them (Arnold et al. 2004). Where there is motivation, the task of analyzing one's intended meanings and the implications of various representations of those meanings may be beyond persons without expertise in knowledge engineering (Marshall and Shipman 2003, Shirky 2003, Clark et al. 2005). Methods of disambiguation usually rely on literal representations and logical entailments, but these may be unintuitive, given the prevalence of metaphorical reasoning in human discourse (Lakoff 1992).
In light of these doubts, before investigating human efficacy in complex and demanding disambiguation tasks, it makes sense to verify that people can perform even minimal disambiguation tasks. A minimal task might have these properties:
Such minimal disambiguation tasks might arise if people were formulating content for the Semantic Web with the support of intelligent agents. Agents might question authors by giving them binary selection tasks. The interactions would be analogous to human-human disambiguation dialogues, as when a hearer asks "Wait, I didn't quite understand. Should I avoid heat and also avoid humidity, or avoid only hot, humid conditions?"
Two minimal disambiguation tasks are paraphrasal selection (choosing between less ambiguous paraphrases of the original sentence) and truth-conditional selection (choosing between two situations representing alternative meanings of the sentence).
For example, if a subject were shown the sentence
(6) Killer bees are migrating from Brazil to Canada.
a paraphrasal selection task might be to choose between
(7) Individual killer bees are migrating from Brazil to Canada.
(8) The killer-bee population is migrating from Brazil to Canada.
and a truth-conditional selection task might be to choose between
(9) This bee was born near Rio de Janeiro, and now it lives near Vancouver.
(10) Ten years ago there were no killer bees north of Brazil, and this year some have been found in Vancouver.
Suppose authors needed to specify explicitly the interpretation they have in mind. Would they know what it is? If so, could they select it from a pair of alternatives? If they could, would paraphrasal or truth-conditional selection work better? If they did not succeed at first, would they become successful with experience or training? Would they find the results worth the effort?
Paraphrasal selection and truth-conditional selection can be considered two different ways of representing a problem, and the way a problem is represented sometimes makes a major difference in success in reasoning tasks (Cosmides and Tooby 1996). But it is not obvious which representation would work better. Paraphrasal tasks might be easier because paraphrasing is a common activity and requires no reasoning or mapping between facts and generalizations; non-experts have been found generally successful in formulating synonymous paraphrases of sentences (Chklovski 2005). Or truth-conditional selection tasks might be easier because they tell stories, and people are skilled at using stories to illustrate or contradict truth claims.
I report here an experiment designed to provide some answers to these questions by asking subjects to perform repeatedly both these kinds of disambiguation on sentences exhibiting quantification ambiguity. Out of many types of ambiguity (Bach 1998), quantification ambiguity (Partee 1997) has long been a focus of semantic research and creates interpretive problems of several kinds. Some of its dimensions (with quantifiers in the examples enclosed in angle quotes) are:
This last subtype of quantification ambiguity was the focus of the experiment. The majority of the expressions presented to subjects included the adverbial quantifier "almost always", syntactically modifying the sentential verb. It can have alternative interpretations as a restriction on continuous time, on a set of events, and on the sets denoted by the verb's arguments. For example, in
(2) Dogs almost always chase cars.
"almost always" may be interpreted to describe "23 out of 24 hours a day", "9 out of every 10 times this could happen", "9 out of 10 dogs", or "9 out of 10 cars", as well as combinations of these. (I ignore other ambiguities, such as whether "dogs" and "cars" describe individuals or kinds, and which dogs and cars the relevant universe contains.) Thus, the sentence may mean that every dog chases almost all cars, almost all dogs chase all cars, or almost all dogs chase almost all cars, among other things. Facts about the world and the context of the sentence will affect the interpretation. Thus, grammatically parallel sentences, such as
(3) Students almost always order pizzas.
(4) Hammers almost always break windows.
(5) Rivers almost always carry pollutants.
may receive nonparallel interpretations.
I conducted the experiment as an exploratory pilot study, with a deliberately limited number of subjects. My main goals were to evaluate and improve the study instrument and to discover relationships meriting further investigation.
I created the experiment for administration on the Web. Subjects could use any Web browser to access and undergo the experiment, at a time of their choosing. The subject's browser needed to support forms and tables, but did not need to support frames, embedded scripting, animation, or applets, nor did subjects need to download any software. Subjects whose progress through the experiment was interrupted by a browser crash, a disconnection, or an inadvertent closing of the browser window could resume the experiment by reconnecting to the experiment's main page and providing a code that had been issued to them. I designed the experiment so that subjects could not change any action previously performed. Using the browser's "Back" button and resubmitting a form, for example, would elicit a page announcing that the wrong form had been submitted and asking the subject to resume the experiment (with the above-described resumption procedure). I used the Apache HTTP Server 2.0 and Perl 5.8 to create the experiment as a CGI application.
The user interface during the main part of the study is illustrated in Figure 1.
![User Interface: Web form titled 'Question 1 of 25'. Content: Consider this sentence: Why do girls almost always have better handwriting than boys? Which of the following situations better illustrates what you think the writer meant? [ ] There were 1000 girls and 1000 boys in the school. Of the girls, 980 had handwriting scores above 50. Of the boys, 980 had handwriting scores below 50. [ ] There were 40 classes of pupils in the school. In 38 of them, the girls' average handwriting scores were higher than the boys', and in 2 of them the boys' were higher than the girls'. Which of the following sentences better expresses what you think the writer meant? [ ] Why does almost every girl have better handwriting than almost every boy? [ ] Why is it that, in almost all groups of young persons, the girls' average handwriting is better than the boys'?](ui.png)
Figure 1. User Interface
Testing and inspection of the application are possible at http://utilika.org/re/aa/test.html. Copies of the software and data are available on request.
A total of 47 subjects completed the study, and another 23 subjects quit the study without completing it. I had publicized the experiment and stated a sole requirement that subjects be able to read and write English. I was unable to enforce this requirement or verify the truth of the subjects' answers to questions about their characteristics. Subjects were anonymous. Thus, some subjects may have been robots, the same person participating multiple times, or people randomly guessing without reading the instructions. Inspection of the data provides little basis for such suspicions, however. No completing subject's completion time was unrealistically fast; only 2 pairs of completing subjects connected from the same IP address; in only 4 cases was there a pair of completing and aborting subjects connecting from the same IP address; no subject submitted more than a small fraction of the pages without selecting an answer; and no subject's answers were identical over any large sequences of pages. Nonetheless, there is no basis for considering the subjects a random sample of any known population.
With the subject-provided data taken at face value, the subjects had the characteristics described in Tables 1-3.
Table 1. Subjects by Native Language
| English a Native Language? | Relative Frequency |
|---|---|
| Yes | 74% |
| No | 26% |
| Count | 47 |
Table 2. Subjects by Best-Known English Variety
| Best-Known English Variety | Relative Frequency |
|---|---|
| United States | 81% |
| Other | 19% |
| Count | 47 |
Table 3. Subjects by Age
| Age Range (Years) | Relative Frequency |
|---|---|
| 0-19 | 4% |
| 20-39 | 64% |
| 40-59 | 26% |
| 60- | 6% |
| Count | 47 |
As part of the publicity and motivation for participation in the experiment, the invitation to participate stated that the first 20 subjects completing the experiment and registering their completion with the Web-based service-contracting service Amazon Mechanical Turk (Amazon.com 2007) would receive a nominal fee ($0.40) and be eligible for a bonus ($10.00) to be paid to whichever subject provided what I judged to be the most useful comments. Of the 47 completing subjects, 20 participated through Amazon Mechanical Turk.
I conducted the study during a 61-hour period, 10-13 March 2007.
Subjects read an introductory page describing the topic and my motivation, then a page describing the terms of participation, and then two orientation pages with an example of an ambiguous sentence and an explanation of the natural diversity of interpretations. They were told that they would be answering questions about the apparently intended meanings of sentences, and that in doing so they could use the sentences, their knowledge, their beliefs, and their experiences.
After the orientation, subjects received a succession of 25 sentences, each accompanied by 1 or 2 disambiguation tasks and, on some trials, a questionnaire about their feelings toward the study. After the last trial, subjects received a questionnaire about themselves and then an acknowledgement thanking them.
On each of the 25 trials, and as part of the questionnaire, subjects were invited to add any comments in a text area.
Of the 25 sentences, 16 contained "almost always" modifying the verb of the sentence, and 5 contained "nearly always" used in the same way. The other 4 sentences contained other verb-modifying quantifiers: "usually", "mostly", "almost never", and "only". With this mixture, I wanted to (1) provide a focus on "almost always" and an apparently synonymous "nearly always" so I could observe changes in subjects' behavior with experience, (2) preview responses to a few other adverbial quantifiers, and (3) avoid monotony.
My choice of a positive-polarity quantifier was based in part on practical impact. Negative-polarity universal and near-universal quantifiers' meanings have little or no truth-conditional difference. For example, the sentence (Liu 1994)
(11) Use of marijuana almost always precedes other drug use.
interpreted with the quantifier having scope over the subject ("Almost all use of marijuana precedes other drug use") says that marijuana smokers are almost destined to use other drugs, while if the quantifier has scope over the complement ("Use of marijuana precedes almost all other drug use") it does not say that. By contrast,
(12) Coin collecting almost never precedes counterfeiting.
says that a coin collector is almost immunized against becoming a counterfeiter, regardless of whether "almost never" has subject or complement scope.
For each stimulus sentence, I created 2 paraphrases and 2 (numeric) truth conditions, with the pairs yoked, so that (in my judgment) each paraphrase had a corresponding truth condition, and each paraphrase/truth-condition pair was distinct in meaning from the other paraphrase/truth-condition pair (see Figure 1 for an example). I performed no external validation on my paraphrases, truth conditions, or pairings.
Each subject, on beginning the experiment, was randomly assigned to one of 4 treatment groups:
0. The task alternated every 5 trials, with paraphrasal selection in trials 1-5, 11-15, and 21-25, and truth-conditional selection in trials 6-10 and 16-20.
1. The task alternated every 5 trials, with truth-conditional selection in trials 1-5, 11-15, and 21-25, and paraphrasal selection in trials 6-10 and 16-20.
2. The task in every trial was paraphrasal selection and truth-conditional selection, in that order on the page.
3. The task in every trial was truth-conditional selection and paraphrasal selection, in that order on the page.
Thus, subjects in groups 0 and 1 performed 25 selections each, and those in groups 2 and 3 performed 50 selections each.
The order of the stimulus sentences across the 25 trials was randomized for each subject.
The 2 alternative responses for each task were displayed in a random order. This implies that subjects in groups 2 and 3 might see the corresponding paraphrasal and truth-conditional responses in the same position or in opposite positions within their pairs.
When performing paraphrasal selection, subjects were asked "Which of the following sentences better expresses what you think the writer meant?". When performing a truth-conditional selection they were asked "Which of the following situations better illustrates what you think the writer meant?".
The questionnaire about the subjects' feelings toward the study was administered on every 5th trial. For subjects in groups 0 and 1, this implies that each administration took place at the end of a block of 5 same-task trials. There was 1 question in this questionnaire, asking the subjects to describe their current feelings about the study on 3 dimensions, with 5 ordinal alternatives per dimension:
| Boring | o o o o o | Interesting |
| Hard | o o o o o | Easy |
| Useless | o o o o o | Useful |
I shall now describe my expectations and the actual results. Since this was a pilot study with a nonrepresentative convenience sample, I shall mostly avoid attaching statistical significances to the results. In light of the small counts, I also have ignored some potentially explanatory variables, such as the subjects' personal attributes (native versus non-native English etc.). Instances in which subjects did not respond are omitted. I had not committed the predictions described below to writing before the experiment, and even if I had done so the use of alternatives drafted by me permitted me to bias the outcomes, so caveat lector.
Although the paraphrasal task and the truth-conditional task both have advantages for user-friendliness, I expected the subjects on average to find the paraphrasal task easier, given the numeric reasoning required by the truth-conditional task. So, I predicted that the mean of the easy-difficult judgments of subjects in groups 0 and 1 at the end of each 5-trial paraphrasal batch would be substantially (at least 1 position) closer to the "easy" end than at the end of each truth-conditional batch. I made no prediction as to whether subjects in groups 2 and 3, those with both tasks to perform, would rate the study easier or more difficult than subjects in groups 1 and 2, those performing only 1 task. Having both tasks on the same page might help subjects gain confidence through bimodal reasoning; or it might worry them about internal consistency and make them feel more heavily worked.
The results point in the predicted direction, but the difference is only about one-third of the minimum predicted magnitude: a mean score of 2.6 versus 2.3 on a scale ranging from 0 (difficult) to 4 (easy). This and the related results for the other feeling dimensions and the other groups are shown in Table 4. The range of scores on no item exceeds 0.6 out of a possible 4.
Table 4. Feelings about the Study by Task
| Treatment Condition | ||||||
|---|---|---|---|---|---|---|
| Para | Truth | Para Truth | Truth Para | All | ||
| Mean Feelings about the Study (0 = least, 4 = most) |
Interesting | 2.9 | 2.7 | 2.3 | 2.5 | 2.6 |
| Easy | 2.6 | 2.3 | 2.4 | 2.1 | 2.4 | |
| Useful | 3.0 | 2.9 | 2.4 | 2.4 | 2.7 | |
As subjects progressed through the experiment's 25 trials, I expected subjects to become more skilled and thus to consider their task(s) increasingly easy. I would not have been surprised by a 1-point (25% of the range) increase in the mean rating from the survey on trial 5 to the survey on trial 25.
Contrary to this prediction, there was a negligible change in the mean perceived ease over the course of the experiment, as seen in Figure 2. The trend line shown here and in subsequent figures is the linear regression line, minimizing the total squared deviation of the shown sample values.

Figure 2. Perceived Task Ease by Trial
We also have a behavioral measure of subjects' satisfaction: whether they completed the study or quit prematurely. If my hypothesis was well-founded, those performing truth-conditional selection would be more likely to quit than those performing paraphrasal selection. Suppose quitting is based on task(s) just performed or the impending task(s) just presented to the subject--what we can call the triggering task(s). Then quitters with a solely truth-conditional triggering task would be expected to be more numerous than quitters with a solely paraphrasal triggering task. This is true, but, as above, the difference is not interestingly great, as seen in Table 5. The overrepresentation of those with a dual triggering task is substantial and appears worth further attention. This could, of course, reflect a greater tendency for those who must perform 50 selections than for those who must perform 25 selections to decide that the study is too much work to justify completion.
Table 5. Quittings by Triggering Task(s)
| Triggering Task(s) | |||||
|---|---|---|---|---|---|
| Para | Truth | Para Truth | Truth Para | ||
| Trigger Type | Latest | 0 | 2 | 4 | 6 |
| Impending | 0 | 2 | 6 | 8 | |
Of the 23 who quit, 7 quit without having seen the first trial, so they were unaware of the task(s) they would have performed and had no triggering task. As indicated in Table 5, 4 more subjects had impending triggers than latest triggers: These subjects quit after seeing the first trial's task(s) but without completing that trial.
Subjects who performed both tasks on each trial may have performed them consistently or inconsistently. I considered their choices consistent if they chose the alternatives that I had classified as corresponding, i.e. representing the same decision about the sentence's meaning. Since the alternatives' orders within each task were randomized, subjects could not learn to apply any ordering rule for consistency. If they wished to be consistent, they needed to examine the texts and decide which truth condition corresponded with which paraphrase. The only treatment condition that might affect their performance was the order of the tasks. Half the subjects performing both tasks saw the paraphrasal task above the truth-conditional task, and the other half saw them in the opposite order. But both tasks would both be visible together on most computer monitors, and subjects could perform and revise both of them, in either order, before submitting their decisions (see Figure 1), so I expected the order to have a minimal impact on performance. I predicted that subjects would tend to perform the first-position task first and that performing a paraphrasal task before a truth-conditional task would contribute marginally (about 10% of the range) to better performance, on the assumption that typical subjects could more easily retain in memory a paraphrase and apply that to analyzing a truth condition than vice-versa.
As shown in Table 6, the difference was about as small as I had predicted, but in the opposite direction. I had not made any prediction about absolute consistency levels, though, and the results show that 77% of all subjects performing both tasks on each trial performed them consistently.
Table 6. Consistency by Task Presentation Order
| Task Presentation Order, Subjects Performing Both Tasks Simultaneously (Trials) | ||||
|---|---|---|---|---|
| Para Truth | Truth Para | Total | ||
| Consistent | 73% | 81% | 77% | |
| Inconsistent | 27% | 19% | 23% | |
| Count | 271 | 214 | 485 | |
Whatever the consistency of subjects might be, I predicted that it would tend to increase during the course of the study. Subjects would learn to perform more consistently with experience, so the fraction making consistent choices would tend to increase from trial to trial. As Figure 3 shows, it did not: there was practically no change.

Figure 3. Consistency by Trial
The subjects' failure to become more consistent looks like a failure to improve with experience, but I collected no data on whether subjects were aware of consistency as a purpose and tried to achieve it. If they were pursuing that purpose, the results suggest that improvements require more than 25 iterations of a selection task, at least if there is no training or discussion about consistency.
Subjects performing 2 tasks per trial would be expected to work more slowly than subjects performing 1 task per trial. If truth-conditional selection is the more difficult task, subjects performing only that task would be expected to complete a trial more slowly than subjects performing only paraphrasal selection. I expected about a 50% greater trial duration for 2 tasks than 1, and among 1-task subjects about a 20% greater trial duration for truth-conditional selection than for paraphrasal selection.
The results, shown in Table 7, partly confirm these predictions. The 2-task trials were completed in a median of 36 seconds, 64% more slowly than the 22 seconds of the 1-task trials. And the median duration of a trial with only truth-conditional selection was 27 seconds, 42% more than the 19-second median duration of a trial with only paraphrasal selection. Both differences were in the predicted direction and stronger than predicted.
Table 7. Trial Duration by Treatment Condition
| Task(s) Performed in Trial | ||||||
|---|---|---|---|---|---|---|
| Para | Truth | All 1-Task | Para Truth | Truth Para | All 2-Task | |
| Median Duration | 19 | 27 | 22 | 29 | 48 | 36 |
| Mean Duration | 29 | 41 | 35 | 38 | 67 | 51 |
| Trial Count | 340 | 335 | 675 | 275 | 225 | 500 |
I had no reason to expect 2-task trials to differ much in duration depending on the presentation order of the tasks, but they differed substantially. The median trial having truth-conditional selection above paraphrasal selection took 48 seconds, while the median trial with the opposite order took only 29 seconds, 60% as long, about the same as the median trial with only truth-conditional selection. A replication of this difference would suggest that truth-conditional selection is so greatly facilitated by prior paraphrasal selection that the time spent on paraphrasal selection is compensated by time saved in truth-conditional selection. It would also suggest that people generally solve problems in the order they appear on a page, even if they are free to solve them in any order they wish and even if it greatly speeds their work to reverse the order.
To get an intuitive idea of the likelihood of this last finding being replicated, suppose that the 20 subjects who were assigned 2 tasks per trial had not suffered depletion from withdrawals, and had been drawn from a population which, in both conditions, would have exhibited normal distributions and equal variances of trial durations. Then a difference (regardless of direction) at least as great as the observed 38-to-67-second difference in their means would occur in fewer than 5% of the samples (t = 2.14, DF = 18, N1 = 11, N2 = 9, M1 = 38, M2 = 67, s12 = 256, s22 = 1707). So, it would not be surprising to find this effect again in a larger sample. The assumptions required for the statistical test are not realistic: The lower bound on duration prevents normality, and in these samples the subjects who saw the truth-conditional task on top exhibited a much greater variance than the opposite-order subjects. Such assumption violations tend to have small effects, however (Hays 1973, pp. 409-410).
As with consistency, I predicted that subjects would learn to solve the problems faster with experience. Unlike with consistency, the results agreed with this prediction, as Figure 4 shows. The median trial duration decreased from 33 seconds to 22 seconds over the 25 trials. Of the 47 subjects, 40 became faster and 7 became slower over time, measured with the same method applied to each individual's sequence of trial durations. Statistically (according to the cumulative binomial distribution), this is a highly significant difference: Fewer than 1 in 10,000 samples of 47 drawn from a population half of which gets faster would contain 40 or more speed-increasing subjects.

Figure 4. Median Duration by Trial
It would be reasonable to expect subjects who failed to complete the experiment to have been slower task performers than completing subjects, for two reasons. First, those who needed more time would find participation more costly. Second, quitters' trials were, on average, earlier in the experiment than completers' trials, and earlier trials tended to take longer. The results were mostly but not entirely consistent with this expectation. The one-half of the quitting subjects that performed 2 tasks with truth-conditional selection on top had a median task duration of 41 seconds, less than completing subjects' 48 seconds in this condition. But the other half of the quitting subjects had median task durations substantially longer than completing subjects: 33 (versus 19), 43 (versus 27), and 40 (versus 29) seconds, on paraphrasal, truth-conditional, and paraphrasal-above-truth-conditional trials, respectively. So, there is not a uniform pattern of those who need more time to perform the tasks being more likely to quit the experiment.
Satisfaction, consistency, and speed, in combination, suggest that subjects typically take the time they need in order to achieve an approximately uniform result under whatever conditions they work under. As the task becomes easier or more difficult, subjects compensate by working faster or more slowly, rather than by getting better or worse results. If this pattern were to be robust, it would pose a challenge. What could one do if one wanted to elevate the quality of the results, such as making those doing the work more satisfied and getting them to achieve greater consistency? Solutions affecting individual results could involve motivations, such as compensation, incentives, and goal redefinition. Solutions not affecting individual results could include applying multiple performers to each task (if the disambiguators were readers rather than authors). Solutions combining these approaches could include redefining the task as a game, with a goal of reaching agreement on meanings with other individuals (cf. von Ahn and Dabbish 2004).
In addition to inter-task consistency and task-completion speed, inter-subject agreement is another plausible indicator of success in disambiguation. Suppose that each sentence has some (objectively) most probable meaning. Then multiple people disambiguating the sentence, if completely successful, will all find that meaning and select it, producing complete agreement. If they are not completely successful, we could expect their agreement to increase with experience. I expected substantial (roughly 75%, halfway between complete agreement and maximal disagreement), and growing, agreement.
The magnitude of agreement was, as expected, about halfway between complete and minimal. Of 1646 subject-sentence-task combinations, subjects made the majority decision (i.e. the decision made by the larger fraction of the subjects in the same sentence-task combination) 1222 times, or in 74% of all cases.
The trend in agreement over the course of the trials, however, showed no evidence of the increase that I had predicted, as Figure 5 shows. There is even a small trend in the decreasing direction. The amount shown for any trial in this figure is the fraction of all choices made by subjects on that trial that were in agreement with the majority of the subjects who (on any trial) performed the same task on the same stimulus sentence. Subjects performing a 2-task trial thus contributed 2 choices to that trial's total. One sentence-task combination, out of the 50, produced a tie, and one of its tied alternatives is arbitrarily chosen as the majority alternative in this analysis.

Figure 5. Agreement by Trial
Given the typical existence of a substantial majority favoring one of the two alternatives in a sentence-task combination, does one of the task types contribute a disproportionate fraction of the deviations from the majority? Those who are performing two tasks are in a more complex situation, which might induce more random-like choices. Or they may be more richly informed and empowered to make more reliable choices. I expected that any impacts would be sentence-specific and the aggregate impact would be insubstantial. Table 8 shows that there was no clear tendency for agreement with the sentence-task majority to be associated with a particular task type. Subjects performing only 1 task per trial were only slightly more often in the sentence-task majority than were 2-task subjects.
Table 8. Agreement by Treatment Condition
| Task(s) Performed | ||||||||
|---|---|---|---|---|---|---|---|---|
| Para | Truth | All 1-Task | Para Truth | Truth Para | All 2-Task | Total | ||
| Majority Choices (Percent) | Paraphrasal Tasks | 77% | N/A | 77% | 73% | 80% | 76% | 76% |
| Truth-Conditional Tasks | N/A | 74% | 74% | 68% | 75% | 71% | 72% | |
| Total | 77% | 74% | 75% | 70% | 77% | 73% | 74% | |
| Choices (Count) | Paraphrasal Tasks | 336 | 0 | 336 | 273 | 217 | 490 | 826 |
| Truth-Conditional Tasks | 0 | 329 | 329 | 272 | 219 | 491 | 820 | |
| Total | 336 | 329 | 665 | 545 | 436 | 981 | 1646 | |
If the character of a task exerts any systematic influence on the judgments made by those performing it, the influence should be revealed in the agreements between pairs of individuals. Two persons who evaluate a sentence with the same task should be more likely to agree on the sentence's meaning than two individuals who evaluate that sentence with different tasks. I expected some systematic biases arising from the task differences, even though I could not specify what they would be. So, I predicted that pairs of subjects performing the identical task on a sentence would agree more often than pairs performing the opposite tasks on it. I expected this influence to be small (about 10%). As Table 9 shows, my prediction was wrong: There was almost no difference. Subjects agreed with one another in about 67% of all cases, whether both did paraphrasal selection, both did truth-conditional selection, or one did one and one did the other.
Table 9. Agreement by Task Identity
| Tasks Identical or Opposite between Pairs of 1-Task Subjects Judging Same Sentence | |||||
|---|---|---|---|---|---|
| Identical | Opposite | Total | |||
| Para | Truth | Total | |||
| Agreeing | 68% | 66% | 67% | 66% | 67% |
| Disagreeing | 32% | 34% | 33% | 34% | 33% |
| Pair Count | 2158 | 2072 | 4230 | 4286 | 8516 |
Although the degree of agreement was insensitive to the task type and did not change much as subjects got more experience, one condition had a major impact on inter-subject agreement. This was the sentences themselves. They produced degrees of agreement almost entirely spanning the possible range, from a near-minimum 54% up to 91%, as shown in Figure 6. The computation here is the same as for Figure 5. Each task-type majority is based on all cases in which subjects judged the sentence with the task, whether that task was alone or combined with the opposite task on the same trial.

Figure 6. Agreement by Sentence
It would be unreasonable to expect a single variable to account for such variation. I surmised that, other than sampling variance, world knowledge rather than linguistic features would be responsible for most of it. But I also hypothesized that these dimensions would interact, in part because the quasi-universal quantifiers "almost always" and "nearly always" tend to produce some highly unrealistic readings, which subjects would usually reject. For example, most people would not expect anybody to utter the sentence
(13) Acquaintances almost always commit murders.
meaning that almost all acquaintances commit murders or that typical acquaintances commit murders continuously, so if either of those readings were offered as an alternative it would be rejected in favor of "Acquaintances commit almost all murders." Milder quantifiers, including "usually" and "mostly", each represented in 1 stimulus sentence, would, I predicted, make ambiguities more difficult to resolve and thus be found near the left end of Figure 6. In addition, I expected that sentences whose verbs had a categorical (generalized-quantifier-type) subject and a categorical complement or modifier, such as "dogs", would be more problematically ambiguous than other sentences, including those with referential subjects (such as "Fido"), because the former would permit more readings than the latter and thus make it easier for anybody formulating alternative paraphrases or truth conditions to make them both plausible. The sentences that escaped from both these tests of difficulty were sentences 1, 2, 6, 7, 9, and 10. On this basis, I predicted that they would be found on the right side of Figure 6. This prediction was almost confirmed: Of these 6 sentences, 5 are indeed on the right side, and 3 of those are the rightmost (i.e. most consensual) 3 sentences of all. Only sentence 6 is on the left side, and even it is near the middle.
Sentences differed substantially not only in the sizes of the majorities agreeing on their meanings, but also in the extent to which the selected meanings differed between task types. Figure 7 shows the fractions of subjects selecting meaning 1, depending on the task type. (As above, I am assuming that my meaning pairings were valid.) The sentences are ordered by the absolute arithmetic difference between these fractions. Of the 25 sentences, 2 (sentences 24 and 22) had opposing majorities in the opposite tasks.

Figure 7. Selected Meaning by Task Type
I had no general hypothesis about causes of inter-task differences, but it may be instructive to examine a sentence whose responses differed greatly by task. Sentence 22 was "Success usually comes to those who are too busy to be looking for it." Its paraphrasal alternatives were:
Of the 36 subjects performing this task on the sentence, 75% chose alternative 1, and of the 17 subjects performing only this task on it 88% chose alternative 1. The truth-conditional alternatives were:
By contrast, only 40% of the 30 subjects performing this task, and likewise 40% of the 10 subjects performing only this task on it, chose alternative 1.
Suppose we assume that the more likely meaning is the meaning in which the subject is the quantifier's scope, expressed by alternatives 1. It would then seem that the paraphrasal task captured appropriate intuitions, but the truth-conditional task defeated those intuitions. Why? Perhaps truth condition 2 grabbed an undue share of sympathy because it, alone, among the 4 formulations, refers to the cause before it refers to the effect. If this is correct, then good causal paraphrases or truth conditions for disambiguation may require parallelism in the order of appearance of cause and effect references, or persons who disambiguate may require training to avoid being excessively influenced by the order in which the references appear.
Some subjects commented on this sentence, and one of them reported finding the truth conditions ambiguous. The comments were:
The subject who wrote comment 2 did not understand "these" in the truth conditions as synonymous with "the latter". This comment suggests a need to review the paraphrases and truth conditions for potentially ambiguous pronominal referents.
In 107 cases out of 1175 trials (9%), and in 12 more cases at the end of the study, subjects provided comments. Here is a selection of comments to show their range and quality:
The willingness of some subjects to comment profusely provides valuable hints about their understandings of both the items and the user interface. For example, one subject indicated that the initial state of the task response, a checked radio button to the left of the box containing the alternatives and their radio buttons (see Figure 1), was ambiguous, possibly representing indifference rather than the absence of a response. I had labeled it "No answer" in a previous draft, but pretesting led me to believe that this would encourage its use. If nothing more drastic is done, it should perhaps be relocated to the beginning of the prompt.
Comments by quitting subjects might elucidate their reasons, but only 1 such comment out of 58 trials (2%) was offered: "overly complicated, difficult to follow", from a subject who had performed 7 trials of truth-conditional above paraphrasal tasks. This is consistent with the tendency for this treatment condition to accompany the highest quitting rate and, among those who did not quit, to consume the most time.
The results presented above suggest that people without training and with about 10 minutes of practice can use paraphrasal and/or truth-conditional selection to disambiguate sentences in about 20-25 seconds per ambiguity, achieving approximately 75% inter-task consistency and 75% agreement on the majority alternatives. If we imagine moderate training, incentives, or multiperson collaboration making the performance satisfactory, this speed is (per person) roughly equivalent to 1600 words per hour, which is about six times as fast as typical human translation (PROZ 2006). This estimate suggests that some investment in human disambiguation could indeed be rational if it produced machine-processable expressions of meaning, ready for use in automated question answering, summarization, retrieval, and translation into many target languages.
It appears that paraphrasal selection is typically less time-consuming than truth-conditional selection and, surprisingly, that the former, if it precedes the latter, may dramatically increase the latter's speed. This possible preparatory effect merits investigation. It seems inconsistent with the fact that the same task order accompanied almost as many abandonments of participation as the opposite task order. Neither task type was markedly superior to the other in subject acceptance or performance, so their further comparison seems warranted, particularly since the task type seems to interact with sentence features to affect the meaning that people select. Sentences' disambiguation difficulty appeared to covary with some quantification and reference-order features. This association might help in the prediction of the severity of ambiguity and in the optimization of disambiguation methods. Diagnostics might be found predicting which task type (or combination of types) works best for which individuals, and for which sentences.
As people repeatedly disambiguated sentences, they became more efficient, but the quality of their disambiguations did not tend to increase, nor did they report any substantial easing of the task difficulty. Task difficulty was reflected mostly in rates of task abandonment and, among those who persisted, in task-performance durations, which people adjusted as if to achieve uniform degrees of satisfaction and output quality. Thus, a topic for future study is what interventions (incentives, training, etc.) can induce subjects to achieve not only greater speed, but also increasing satisfaction, consistency, and agreement, as they gain experience.
Consistency and agreement rather than objective correctness were the main indicators of performance, but they were dubious measures of quality in this experiment. Subjects performing two tasks per sentence were not trained or asked to choose semantically consistent alternatives, and they could have had goals other than consistency. Given that most of the sentences were selected for their ambiguity and that all sentences were presented without context, I expected substantial inter-subject disagreement. The aggregate pairwise agreement of 67% exceeded chance by one-third of the possible maximum (Κ = 0.34), near the low end of the approximate range of Κ values--0.3 to 0.6--reported on previous word-sense disambiguation tasks (Chklovski and Mihalcea 2004). But the size of the majority varied greatly among sentences, and in ways that suggest that agreement approaches perfection when ambiguity is low. Future experiments could vary the amount of accompanying discursive context and, with adequate context, could treat some answers as correct and others as incorrect.
A fundamental distinction between the tasks performed in this experiment and the real-world tasks motivating the research is that subjects in the experiment were acting as readers rather than authors. Subjects needed to guess what sentences mean instead of knowing the meanings that they, the subjects, wanted to express. It is plausible that the experimental tasks were more difficult than real-world author-disambiguation tasks because of this knowledge gap, although it cannot be assumed that authors always have unambiguous meanings in mind. Future research could compare these kinds of disambiguation, especially since disambiguation by readers, like abstracting and indexing by persons other than authors, is also a real-world task. Such research would need to confront the problem of discovering what meanings authors wish to express and how they express those meanings, without using the latter as a proxy for the former. Or, if the technique is putting subjects into situations where they have only one plausible meaning to express, the research would need to avoid putting the words into subjects' mouths as the means by which to motivate them to express that meaning. Such experiments have been conducted on collaborative task performance, using spatial problem-solving or puzzle tasks (e.g., Gergle et al. 2006), but extending this paradigm to the range of meanings characteristic of the Web would be a challenge.
As expected in a pilot study, the discoveries I describe here are compromised by the small, nonrandom, and unverified characteristics of the subjects. Moreover, two other possible contaminants are worth mentioning. One is differential attrition. Subjects were randomly assigned to treatment groups, but then each subject decided whether to complete the study or quit early. Subjects in 2-task treatment groups quit more often than subjects in 1-task treatment groups. Perhaps a difficult task induces the least skilled subjects to quit (in frustration), and an easy task induces the most skilled subjects to quit (in boredom). If so, the randomization of treatment assignments may not guarantee the statistical independence of subject characteristics from treatments, no matter how large and representative the sample becomes. The other contaminant is the numbers that I chose to use in formulating the truth conditions. Subjects presumably paid some attention to the realism of those numbers as they chose between the alternatives, and there is presumably no way to choose neutral numbers for such situation descriptions. The arbitrariness of numeric truth conditions may be an inescapable problem, whether they are formulated by persons or generated by algorithms.
Amazon.com, "Amazon Mechanical Turk" (Web site), 2007; http://www.mturk.com/mturk/welcome.
Jennifer E. Arnold, Thomas Wasow, Ash Asudeh, and Peter Alrenga, "Avoiding Attachment Ambiguities: The Role of Constituent Ordering", Journal of Memory and Language, 51, 2004, 55-70; http://www-csli.stanford.edu/~wasow/AWAA_final.pdf.
Kent Bach, "Ambiguity", Routledge Encyclopedia of Philosophy (London: Routledge, 1998); Bach, 1998.
Mark C. Baker, "On the Absence of Certain Quantifiers in Mohawk", ch. 2 in Quantification in Natural Languages, ed. E. Bach et al. (Amsterdam: Kluwer Academic Publishers, 1995), pp. 21-56.
Hanoch Ben-Yami, Logic & Natural Language: On Plural Reference and its Semantic and Logical Significance (London: Ashgate, 2004).
Tim Berners-Lee, James Hendler, and Ora Lassila, "The Semantic Web", Scientific American, 284(5), 2001, 34-43; http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2
Arendse Bernth, "EasyEnglishAnalyzer: Taking Controlled Language from Sentence to Discourse Level", CLAW 2006, 2006; http://www.mt-archive.info/CLAW-2006-Bernth.pdf.
Maria Bittner, "Quantification in Eskimo: A Challenge for Compositional Semantics", ch. 3 in Quantification in Natural Languages, ed. E. Bach et al. (Amsterdam: Kluwer Academic Publishers, 1995), pp. 59-80.
Berry, Erik Kamsties, and Michael M. Krieger, "From Contract Drafting to Software Specification: Linguistic Sources of Ambiguity--A Handbook", manuscript, 2003; http://se.uwaterloo.ca/~dberry/handbook/ambiguityHandbook.pdf.
Hervé Blanchon, "Interactive Disambiguation of Natural Language Input: A Methodology and Two Implementations for French and English", IJCAI 97, 1997; http://www-clips.imag.fr/geta/User/herve.blanchon/Docs/IJCAI-97.pdf.
Mariano Ceccato, Nadzeya Kiyavitskaya, Nicola Zeni, Luisa Mich, and Daniel M. Berry, "Ambiguity Identification and Measurement in Natural Language Texts", University of Trento, Department of Information and Communication Technology, Technical Report DIT-04-111, 2004; http://eprints.biblio.unitn.it/archive/00000727/01/Report_v26_DIT_04_111.pdf.
Francis Chantree, "Ambiguity Management in Natural Language Generation", manuscript, 2003; http://www.cs.bham.ac.uk/~mgl/cluk/papers/chantree.pdf.
Timothy Chklovski, "Collecting Paraphrase Corpora from Volunteer Contributors", K-KAP 2005, 2005; http://www.isi.edu/~timc/papers/KCAP05-Chklovski-paraphrases.pdf.
Timothy Chklovski and Rada Mihalcea, "Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation", RANLP 2003, 2003; http://www.isi.edu/~timc/papers/ranlp2003.pdf.
Peter Clark, Phil Harrison, Tom Jenkins, John Thompson, and Rick Wojcik, "Acquiring and Using World Knowledge Using a Restricted Subset of English", FLAIRS 2005, 2005; http://www.cs.utexas.edu/users/pclark/papers/flairs.pdf.
Ann Copestake, Dan Flickinger, Ivan A. Sag, and Carl Pollard, "Minimal Recursion Semantics: An Introduction", Research on Language and Computation, 3, 281-332, 2005; http://www.cl.cam.ac.uk/~aac10/papers/mrs.pdf.
Leda Cosmides and John Tooby, "Are Humans Good Intuitive Statisticians after All? Rethinking Some Conclusions from the Literature on Judgment under Uncertainty", Cognition, 58, 1-73, 1996; http://www.linguistics.pomona.edu/LGCS121Spring2005/Reading/CosmidesTooby1996.pdf.
Markus Egg, Alexander Koller, and Joachim Niehren, "The Constraint Language for Lambda Structures", Journal of Logic, Language and Information, 10, 457-485, 2001; http://scidok.sulb.uni-saarland.de/volltexte/2004/293/pdf/clls2000.pdf.
Nick Evans, "A-Quantifiers and Scope in Mayali", ch. 8 in Quantification in Natural Languages, ed. E. Bach et al. (Amsterdam: Kluwer Academic Publishers, 1995), pp. 207-270.
Leonard M. Faltz, "Towards a Typology of Natural Logic", ch. 9 in Quantification in Natural Languages, ed. E. Bach et al. (Amsterdam: Kluwer Academic Publishers, 1995), pp. 271-319.
Norbert E. Fuchs, Kaarel Kaljurand, and Gerold Schneider, "Attempto Controlled English Meets the Challenges of Knowledge Representation, Reasoning, Interoperability and User Interfaces", FLAIRS 2006, 2006; http://www.ifi.unizh.ch/attempto/publications/papers/FLAIRS0601FuchsN.pdf.
Darren Gergle, Robert E. Kraut, and Susan R. Fussell, "The Impact of Delayed Visual Feedback on Collaborative Performance", CHI 2006, 2006; http://www.soc.northwestern.edu/dgergle/pdf/Gergle_CHI2006.pdf.
William L. Hays, Statistics for the Social Sciences, Second Edition (New York: Holt, 1973).
George Lakoff, "The Contemporary Theory of Metaphor", in Metaphor and Thought, 2nd edn., ed. Andrew Ortony (Cambridge: Cambridge University Press, 1992); http://www.ocf.berkeley.edu/~katclark/coganthrodecal/lakoff.pdf.
Liang Y. Liu, "Executive Summary: 1994 Texas School Survey of Substance Use Among Students: Grades 4-6", 1994; http://www.tcada.state.tx.us/research/survey/grades4-6/1994/execsumm.html, 1994.
Jaap Maat, Philosophical Languages in the Seventeenth Century: Dalgarno, Wilkins, Leibniz (Amsterdam: Institute for Logic, Language and Computation, Universiteit van Amsterdam, 1999).
Catherine C. Marshall and Frank M. Shipman, "Which Semantic Web?", Hypertext '03 Proceedings, 2003; http://www.csdl.tamu.edu/~marshall/ht03-sw-4.pdf.
National Public Radio, news broadcast, 31 August 2006.
United States National Library of Medicine and National Institutes of Health, "Itching", in Medical Encyclopedia (Web site), 2007; http://www.nlm.nih.gov/medlineplus/ency/article/003217.htm.
Natalya F. Noy and Deborah L. McGuinness, "Ontology Development 101: A Guide to Creating Your First Ontology", Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, 2001; http://smi-web.stanford.edu/auslese/smi-web/reports/SMI-2001-0880.pdf.
Patricia T. O'Conner, Woe is I: The Grammarphobe's Guide to Better English in Plain English (New York: Riverhead, 1996).
Barbara H. Partee, "Compositionality", in Varieties of Formal Semantics, ed. F. Landman and F. Veltman (Dordrecht: Foris, 1984), pp. 281-311.
Barbara H. Partee, "Quantificational Structures and Compositionality", ch. 16 in Quantification in Natural Languages, ed. E. Bach et al. (Amsterdam: Kluwer Academic Publishers, 1995), pp. 541-601.
Barbara H. Partee with Herman L.W. Hendriks, "Montague Grammar", ch. 1 in Handbook of Logic and Language, ed. Johan van Benthem and Alice ter Meulen (Amsterdam: Elsevier Science, 1997), pp. 5-91.
Barbara H. Partee, "Noun Phrase Interpretation and Type-Shifting Principles", ch. 10 in Studies in Discourse Representation Theory and the Theory of Generalized Quantifiers, ed. J. Groenendijk, D. de Jong, and M. Stokhof (Dordrecht: Foris, 1986), pp. 203-229.
Karen Petronio, "Bare Noun Phrases, Verbs and Quantification in ASL", ch. 17 in Quantification in Natural Languages, ed. E. Bach et al. (Amsterdam: Kluwer Academic Publishers, 1995), pp. 603-618.
Paul Pietroski, "Logical Form", The Stanford Encyclopedia of Philosophy (Spring 2007 Edition), ed. Edward N. Zalta (Web site), 2007; http://plato.stanford.edu/archives/spr2007/entries/logical-form/.
Jonathan Pool, "Can Controlled Languages Scale to the Web?", CLAW 2006, 2006; http://turing.cs.washington.edu/papers/pool-clweb.pdf.
Richard Power, "Controlling Logical Scope in Text Generation", ENLG 1999, 1999; http://www.itri.brighton.ac.uk/~Richard.Power/ewnlg-drt.ps.
Richard Power and Roger Evans, "WYSIWYM with Wider Coverage", ACL 2004, 2004; http://www.itri.brighton.ac.uk/~Richard.Power/acl04.pdf.
PROZ: The Translators Workplace, "What Is the Realistic Translation Speed?" (Web discussion), 2006; http://www.proz.com/topic/40966.
Marcus Sammer, Kobi Reiter, Stephen Soderland, Katrin Kirchhoff, and Oren Etzioni, "Ambiguity Reduction for Machine Translation: Human-Computer Collaboration", AMTA 2006; http://turing.cs.washington.edu/papers/CLMT-AMTA-2006.pdf.
Rolf Schwitter, "Controlled Natural Language as Interface Language to the Semantic Web", IICAI-05, 2005; http://www.ics.mq.edu.au/~rolfs/papers/IICAI-schwitter-2005.pdf.
Clay Shirky, "The Semantic Web, Syllogism, and Worldview", manuscript, 2003; http://www.shirky.com/writings/semantic_syllogism.html.
Arturo Trujillo, Translation Engines: Techniques for Machine Translation (London: Springer, 1999).
Marcia Damaso Vieira, "The Expression of Quantificational Notions in Asurini do Trocará: Evidence Against the Universality of Determiner Quantification", ch. 20 in Quantification in Natural Languages, ed. E. Bach et al. (Amsterdam: Kluwer Academic Publishers, 1995), pp. 701-720.
Luis von Ahn and Laura Dabbish, "Labeling Images with a Computer Game", CHI 2004, 2004; http://www.cs.cmu.edu/~biglou/ESP.pdf.
Kai von Fintel and Sabine Iatridou, "Anatomy of a Modal Construction", manuscript, 2006; http://web.mit.edu/fintel/anatomy.pdf.
Thomas Wasow, Amy Perfors, and David Beaver, "The Puzzle of Ambiguity", in Morphology and the Web of Grammar: Essays in Memory of Steven G. Lapointe, ed. O. Orgun and P. Sells (Stanford: CSLI Publications, 2005); http://montague.stanford.edu/~dib/Publications/lapointe_paper_9-4.pdf.
Dag Westerståhl, "Generalized Quantifiers", The Stanford Encyclopedia of Philosophy (Winter 2005 Edition), ed. Edward N. Zalta (Web site), 2005; http://plato.stanford.edu/archives/win2005/entries/generalized-quantifiers/.
Edward N. Zalta, "Gottlob Frege", The Stanford Encyclopedia of Philosophy (Spring 2007 Edition), ed. Edward N. Zalta (Web site), 2005; http://plato.stanford.edu/archives/spr2007/entries/frege/.
I gratefully acknowledge valuable comments by Susan M. Colowick and Emily Bender on previous versions of this report.