Languages of India

I had the opportunity at the SIGIR conference to attend a tutorial on Languages of India (for Information Retrieval).  This map is a great illustration of not just the language diversity but also the diversity of scripts used in the various languages.

There are 22 languages recognized by the Constitution of India, though there are 29 languages spoken by more than a million native speakers.

On the internet, according to the Internet World Stats quoted in the tutorial, with about 60 million users online of the 1.1 billion population, that’s only just over 5% penetration, though the internet usage growth is put at 1100% (2000-2007).

The CLIA project funded by the Department of Information Technology in India brings together a consortium of 11 institutes to promote online information access in Indian languages.  Their Forum for Information Retrieval Evaluation (FIRE) will be putting in place an infrastructure and a set of resources to support the development and evaluation of information access in Hindi, Bangla, Marathi, Tamil, Telugu, Punjabi and Malayalam (and English).

I very much enjoyed the tutorial (lots of information about morphology and sytnax of the various languages that I won’t go in to here) and I look forward to seeing the success of the FIRE initiative.

The E.U. – a ‘Geographic Expression’?

Shortly after the Irish referendum on the Lisbon Treaty, a visiting professor from the University of Maryland remarked over lunch on his fascination in watching the process unfold as it reminded him so much of the history of the United States.  True, when you look at the Articles of Confederation and the statement that ‘under the Articles (and the succeeding Constitution) the states retained sovereignty over all governmental functions not specifically deputed to the central government’ and the fact that some of the concern around the articles related to the contention that they ‘did not strike the right balance between large and small states in the legislative decision making process‘.  [Compare this to the statement that, “The Treaty of Lisbon will be an international treaty agreed and ratified by sovereign Member States that agree to share some of their sovereignty in supranational cooperation“].

On the other hand though, and returning to ‘The Post-American World’,  I was taken by Zakaria’s description of India as “A Geographic Expression” when examining the role of regional politics in India.

“All politics is local”… In India, that principle can be carved in stone.  India’s elections are not really national elections at all.  They are rather simultaneous regional and local elections that have no common theme.

India’s diversity is four thousand years old and deeply routed in culture, language, and tradition.  This is a country with seventeen languages and 22,000 dialects that was for centuries  a collection of hundreds of separate principalities, kingdoms and states….

…The Hindu-Muslim divide might be crucially important in one set of states, but it is absent in others.  Political leaders who are strong in Tamil Nadu have no following whatsoever in the north.  Punjab has its own distinct political culture that relates to Sikh issues and the history of Hindu-Sikh relations.  Politicians from Rajasthan have no appeal in Karnataka.  They cannot speak each other’s language – literally.  It would be like holding elections across Europe and trying to talk about the same issues with voters in Poland, Greece, France, and Ireland. Winston Churchill once said that India was “just a geographic term, with no more political personality than Europe”.

And so the difficulty of moving the European Union along…

Unfortunate Tagline

I picked up a free SearchMe T-shirt yesterday.  The tagline for the search engine on the back of the t-shirt was, “You’ll know it when you see it”. Unfortunately, this brought to mind for me the attempts of the US Supreme Court to arrive at a definition of obscenity and the quote, “I know it when I see it”. Uh, so exactly what kinds of information is this search engine helping me find?

On visiting the search engine at though, I learned two things:

1. This tagline no longer appears on the site, replaced by “Find. Organize. Share”.  I see from screenshots at an earlier review of the site though, that the tagline used to be there.  I can see why the original tagline seemed appropriate , since it bills itself as ‘visual search’, but I wonder if somebody copped on to the reference and decided it may not be the best tagline after all.

2. The SearchMe interface is really interesting.  I liked the visual navigation of result pages (though with the text list at the bottom rather than visual only).  I naturally used my mouse scroll wheel, though I found it scrolled the documents two-at-a-time, so resorted to using the keyboard arrow keys.  I was just using it in ‘test mode’ last night – I’ll need to go back there when I have actual searches to carry out and see how it really works (particularly the ‘categories’ it uses).  But the interface is interesting…. worth a look!

Google in China (and Africa)

I’ve been attending the SIGIR’08 conference in Singapore this week.  A very good keynote address was given by Dr. Kai-Fu Lee of Goolge entitled, “Delighting Chinese Users: The Google China Experience“.  The talk was about how Google, over the past two and a half years or so has been trying to win over Chinese users; about being humble about users and not presumptuous; about how it’s so much more than just localisation of existing products into Chinese language, but rather understanding the fundamental differences of Chinese users and their use of the internet and developing wholly new features or products for the market.

Some of the differences he highlighted with respect to the Chinese market is that users are younger (average 25 versus 45 in the US), they access the internet from internet cafes or mobile devices (not home computers) and new users are coming online at a growth rate of 35% – so users are typically ‘new’ to the internet rather than seasoned.  From a language perspective, the challenge is the cumbersome entry of Chinese text (one of Google’s innovations he discussed is a new input method editor) which means users prefer click-based browsing to query reformulation (though he mentioned click-based browsing may also be partly attributable to their tendency toward wanting to ‘learn all about’ a given topic rather than navigate to a particular place), but also there is the advantage of the density of Chinese text – much more information is conveyed in fewer characters so you can show more information in a given area.

Also related to language, he observed that Chinese searchers often find that the information they are looking for (e.g. medical information) is not available in Chinese, so Google machine translation from English into Chinese is an important part of the offering.

All in all a very informative and entertaining talk (great ‘Chinglish‘ examples in comparing MT to human translation).

From this article at the New York Times, I see the same approach being adopted by Google in Africa.

“Africa is a huge long-term market for us,” Eric E. Schmidt, Google’s chief executive, said by e-mail. “We have to start by helping people get online, and the creativity of the people will take care of the rest.”….

….“A lot of people assume Google is trying to replicate in Africa what it has done elsewhere,” adds Mr. Kiagiri, who transferred last year from Google’s head office in California. “Sure, we want to bring existing products into this market. But we also want to organize information locally in a way we haven’t done elsewhere.”

This is the approach that Kai-Fu Lee outlined in China; hire local engineers and try to really understand the local situation (e.g. in Africa, mobile apps for low-end phones) and “empower local flexibility”.

[As an aside, Kai-Fu also mentioned that the name ‘Google’ presents something of a challenge to Chinese sepakers and, while not as bad as some other examples, they’re taking to the use of as their domain in China.]

Finally, on a related note, I see that a post on Google hitting 40 languages has been added over at the Google blog.

The Post-American World

On the recommendation of Fred Wilson,  I’ve just finished reading The Post-American World by Fareed Zakaria.  I really enjoyed the book but I’ve also really tripped across the uneven or incoherent treatment of Europe, specifically the European Union, particularly in the forward-looking (“Post-American”) part of the book.  It’s going to take more than one post for me to work though my thoughts on this, but as I was reading through the book I really expected to find a chapter dedicated to the European Union just like China (‘The Challenger’) and India (‘The Ally’) as a large force in the new multipolar order.

Working through references to the European Union is useful.  The first reference is on page 4, which is useful because it’s included in the statement that, “Functions that were once controlled by governments are now shared with international bodies like the World Trade Organization and the European Union” (so that sets out his view of the EU).

On page 43, “The European Union now represents the largest trade bloc on the globe, creating bipolarity…” , so there is the acknowledgment of the EU as an economic power.  Some other references to economic statistics use the ‘Eurozone’ or ‘Europe’ rather than the EU, like on p195 (“The Eurozone has been growing at an impressive clip, about the same pace per capita as the United States since 2000.  It takes in half the world’s foreign investment, boasts labor productivity often as strong as that of the United States, and posted a $30 billion trade surplus in 2007 from Janguary through October…. All in all, Europe presents the most significant short-term challenge to the United States in the economic realm”) . Related to growth of financial stock on p204, “…the Eurozone’s is outpacing America’s which clips along at 6.5 percent.  Europe’s total banking and trading revenues, $98 billion in 2005 have nearly pulled equal to U.S. revenues of $109 billion.”

In terms of international relations, on p125, “Were the United States and the European Union to adopt fundamentally differing attitudes toward the rise of China, for example, it would put permanent strains ont he Western alliance that would make the tensions over Iraq look like a minor spat.” and on p216 an observation on the fact that the EU did not take the role resolve the Parsley crisis.   On page 207, “Even on immigration, the European Union is creating a new ‘blue card’ to attract highly skilled workers from developing countries”.

So I disagree with Mr. Zakaria’s decision not to specifically examine in a coherent way the EU’s role in the Post-American World – it deserves it’s own chapter!  If we go back to the first reference in the book that distinguishes between ‘governments’ and ‘international bodies’ then it’s certainly worth looking at the move toward an increasing ‘government’ role of the EU, for example in the Treaty of Lisbon and how that plays into the EU’s ability to take a more important role in a multipolar world, not just economically (which it clearly already is) but also from the point of view of international relations and politics.

There’s also the question of course of whether the EU will (or even wants to) continue to move (and how far) in that direction of increasing the role of the EU ‘government’ or not, and to what extent the trends outlined in this book play into that (driving the need for a ‘strong’ European Union in a multipolar world).  This is all very topical given our recent rejection of the referendum on the Lisbon Treaty!

I did enjoy the book – it definitely gave me a lot to think about.

Things I’ll Miss – #1: Fireflies

I numbered the title of this post #1 since I think it’s something I’ll return to as I notice things that I miss from the US in settling back in Ireland.  I just returned from probably (hopefully) my last week ‘at home’ in Syracuse (‘hopefully’ because closing on our house there should happen soon).  My wife and I enjoyed several evenings sitting out on our deck and I really noticed the fireflies.  People have been commenting that they haven’t been as plentiful in recent yeras, but we enjoyed some lovely displays on several evenings.

I was explaning to my wife what I had read or watched in a documentary at some point that certain kinds of fireflies have a female that ‘fakes’ the glow (glowing is part of the mating process) of another kind of firefly soley to attract and eat unsuspecting males.  I found the following in the Wikipedia entry on fireflies:

Female Photuris fireflies are known for mimicking the mating flashes of other fireflies for the sole purpose of predation. Target males are attracted to what appears to be a suitable mate, and are then eaten. For this reason the Photuris female is sometimes referred to as “femme fatale“.

At our ‘going away’ party held in the garden of our friends’ house the fireflies were again out in force and somebody raised the question, “Hey, is it true that they don’t have fireflies in Ireland?”.  Yup it’s true.  This also led to more firefly discussion.  From this I also learned that if you ‘squish’ a firefly, say on your arm, there will be a residue left that glows for a while – and this was duly demonstrated.

I enjoyed watching the fireflies on my visit!

Geeks and Shrinks

We’re recruiting for an IP Manager at the CNGL and it’s proving a tricky enough task. I’m reminded of a book I read some years ago, “The Future of Success” by Robert B. Reich (whence the title of this post). One of the things I took away from the book was the value of people who really understand technology, but also have a good appreciation for market needs, product design, etc. To quote from one of the Amazon reviews:

Reich develops great metaphors to describe working people in few words. One of them is the Geeks and the Shrinks. The Geeks are the ones who know how to gather and manipulate data so as to develop new products and services. The Shrinks are the ones who research and understand what consumers really want through market research, focus groups, and other tools. The Geeks and Shrinks are like the Yin and Yang of this new business world. They both need each other to create new markets of products and services.

It’s that Yin/Yang of Geek/Shrink that we’re looking for. Somebody who can really understand the technology and research being conducted at the Centre (Language Technologies, Digital Content Management, and Localisation) while also having an eye for what may be valuable in the market, worth protecting etc.

Of course we understand that this mix is fairly rare (particularly when you focus in on specific technologies). We’re taking the view that ‘more Geek’ is better in that somebody with the right technology background and desire for this kind of role can work into the role more easily than somebody who doesn’t think being called a ‘geek’ is a compliment!

If you think you know somebody who might fit the bill… send them our way!

Yahoo! Exodus – Wow!

While I was reading about the Powerset acquisition over at Techcrunch, I also came across this list of employees who have left Yahoo! While it’s not so much the length of the list (Yahoo! is a big company), it’s the employees who have left most recently – over the past couple of months; June/July 2008 – that really astounded me; founders of Flickr and, Jeremy Zawodny (whose blog I’ve followed on and off for quite some time), Qi Lu, etc. – really significant. I know there’s a ton of coverage/debate of the whole Yahoo! Microsoft thing over at Techcrunch (among other places) and I’m not getting into that… to me this list gives me a better indication of what’s going on at Yahoo! than the media bluster – and it doesn’t look good.

The Powerset Acquisition

I have a particular interest in the use of Language Technologies in the Search arena.  The Microsoft acquisition of Powerset for about $100M is therefore noteworthy.  Powerset is invariably described as a ‘Semantic Search’ engine (within Linguistics, Semantics relates to the study of meaning), though Powerset states the goal as, “to change the way people interact with technology by enabling computers to understand our language“.

While it’s great to see Language Technology get some limelight (particularly in the search space where it is often dismissed), I must say I’ve been underwhelmed each time I’ve interacted with Powerset’s various demos and systems along the way.  It’s a difficult problem; to generally demonstrate the improvements from using this technology in search.  First, Powerset has been restricted to searching Wikipedia rather than  a large Web index.  Secondly some (many) queries just won’t benefit very much from what Powerset is doing compared to the results of other engines (while some queries will benefit a lot and Powerset will return impressive results).  I think Microsoft will be able to help with both of these.  The scaling of the Powerset indexing to web content has been addressed in much of the coverage.  I suspect that another part of it though will be determining which search queries to Microsoft will benefit most from using Powerset’s approach.  It will be interesting to see where Powerset shows up at Microsoft search.

I’m glad to see that Powerset took this course.  When people asked me what I thought of Powerset I found myself comparing them to Whizbang! (from some years back, where Barney Pell was also involved).  They also had great technology and an exceptional team of people but I never thought they had the product /market strategy to make it (and timing didn’t help them either).  I didn’t think Powerset could make it in search on their own either, so I’m glad to see them team up with Microsoft.

Finally, I hadn’t known this before moving back, but some of the folks at the National Centre for Language Technology (NCLT) here at DCU have particular expertise in this area.  I’m not going to go in to Parsing and LFG here, but some information on the technology Powerset licensed from Xerox Parc can be found at Parc Research.  Interestingly, Enterprise Ireland has recently funded researchers at the NCLT to further develop similar (they claim it’s better!) parsing technology for potential application in search…

University Research – how things have changed.

One of the local blogs I’ve recently discovered and have thoroughly enjoyed reading is written by the president of our own university over at UniversityDiary. A recent post entitled “University Research – What is the Agenda” caught my attention since it speaks directly to my own experience.

The post opens, “University research has really only been a serious activity in Ireland since the late 1990s.” How true that is. I completed my undergrad at DCU in 1989 and then an M.Sc. by research in 1991 that was funded through one of the very first European Framework funding programs (I think it was ESPRIT; the project was called SIMPR). After my M.Sc. I stayed at DCU for a while to work on another European-funded project in Machine Translation (Eurotra). But apart from the European funding programs at the time, there was no serious source of research funding for University research as far as I recall – nothing from within Ireland.

How things have changed! It was the significant commitment to research that I’ve seen over the past few years that was a very big part of the attraction back to Ireland. It caught my attention while working in New York. I’ve been following the developments and announcements from Science Foundation Ireland for a number of years now. It’s impressive stuff. What strikes me particularly is the cross-agency collaboration I’ve seen go in to a number of the research programs. My work now is within a CSET (Centre for Science Engineering and Technology) funded by SFI, but it’s clear that there has been significant co-operation across SFI, IDA and Enterprise Ireland in bringing these kinds of research programs to fruition. The fact that the Irish Government at a high level has set out an entire National Development Plan with a strategy for scientific research and then followed through to deliver on that with, by the way, collaboration across agencies; that impresses me (think of this as through the eyes of somebody who left with that snapshot of research in Ireland in the early 90’s).

Ferdinand does raise some valid issues related to the current research environment (I can attest to the contractual conditions of researchers here) and there should be some focus on these, but things certainly have come a LONG way. I wholeheartedly hope that the government, while it looks for ‘savings’ in the current economic climate, stays the course on its commitment to research funding – it’s planting the seeds for the future.

