Breakout Session: Computers Hearing and Seeing

Moderated by Scott Gardner, General Manager and EIC, Embedded Vision Alliance

Featuring Bruce Kleinman, Corporate VP, Platform Marketing, Xilinx; Gary Clayton, Chief Creative Officer, Nuance Communications; and Marc Tremblay, Distinguished Engineer, Microsoft

Discussion: User-interfaces – are these ‘science fiction interfaces’ or are they actually practical?

Is there are a synergy between visual and audio technology?

Once it becomes cloud based it is likely to become more powerful. This is the Holy Grail and everyone wants to go there. Nuance is attempting to move towards mining the intent of the user as opposed to just the instructions given by the user.

3 primary uses for speech

-Command and control



These are the 3 ways that speech is typically thought to fit but there seems to be a growing demand for security in speech.

Are there specific segmentations for where and what types of technology are used in specific circumstances i.e. what would you use in the living room, in the office, in the bedroom?

–          Multi-modality options will be required for various locations and uses. Intent is a key factor in this when building a personal profile of the user. It should be looking at individual needs, social needs etc.

The metric of merit is not the accuracy of the engine, rather the satisfaction of the customer.

Can we do some of this data mining to personalize it to the customer instead of just mass marketing media?

This is already happening. Search is big part of what we are doing already.

Searching is a broken term. I want to ‘find’. If you are overly filtered you may end up in the wrong place. These are all fascinating pieces, but they need to be separated otherwise the naturalness of the interaction falls apart. Context plays a key role in a more accurate search.

How will the natural integration of identifying which device you are trying to talk to if you have 3 devices in the same room?


Where is the next-wave of usage applications going in this space?

You have to break it apart. Obvious things include things like voicemail to text. A lot of the rest turns on the movement from natural language and natural language understanding that applies to all kinds of data. Speech becomes a subset for the things that we saw for natural language understanding in 2001.

Every major corporation is using it for extracting and interpreting data. Customer service issues and problems will be huge as we move away from the primitive voice services available now. Also healthcare. For Nuance, our big areas are customer service and health care.

Are we heading for a cliff though? Technology has reached a point where my productivity is going backwards. I don’t see a lot of attention being paid to how we usefully absorb all of this. We are becoming beholden to all the. I believe the killer app will be a true digital assistant that takes all the information and take it over for me. I can see corporations funding that, and individuals paying to purchase a service like this.

Take the sociology of that further. The systems will never be perfect. The systems aren’t about how you incorporate a human being into them. There needs to be an engineered-in a failover to a human being to ensure that if the system fails, you get a human answer.

We have assumed that the world looks like Google, which is interesting and scary at the same time. We assume all the world’s info is in one place. The digital assistant can begin to pick and choose the information that you want and will use, creating a new perspective on this. Machines are better at that than humans, so this is where the promise lies.

The kind of things a digital assistant would do, would be to sort and delete email based on my regular activity i.e. how often I delete, reply, file etc. and only presents me with what is relevant.

I think discussing email is a waste of time. Every single person today has discussed how their kids use technology.


Let’s bring it back to the sight and sound context.

A lot of it comes down to the speed of communication. There is a mismatch between communication from one person to another. Is there a way to convey a message using visual and speech technology to quickly communicate between people.

When we say natural, we put all the effort on the machine, to translate what was never meant to be conveyed. You are taking all the imperfections from the human element and putting them on the machine. At what point do you need to learn something entirely different.


Is there another area in which you can monetize the technology – other than medical and services. What’s the next big thing?

Something Americans aren’t as tuned into as they are in the UK, but video camera monitoring. It’s a big brother kind of thing. There is a big business now in surveillance. Changing demographics in society, an aging population, results in a lower young population to fulfill certain lower paying jobs.

One of the markets for this population is computer care for the elderly instead of human care. If it is all automated it monitors activity for the aged. It is less intrusive. GE already does this.

No one mentioned translation. In the context of international business and understanding the nuances of foreign languages, this can be hugely powerful in business.

What about the integration of this into robotics. Mundane tasks in a factory…


There have been niche applications forever, but is there a reason why the technology base has been so fragmented?  Is there a unifying, broad opportunity in vision?

It’s formative. There needs to be a critical mass to provide a venue for people to exchange ideas so that we don’t go through an extended period of fragmentation. We don’t want to invent the same thing 50 times. We are in the formative phase.

To go back to the aging population: This could well be the ‘killer’ app. The Japanese are really focusing on the personal robot. They have a significantly aging population that lives for a long time. They are pouring huge amounts of money into personal robots and this concept. Right now this is more expensive than a human, but as the cost goes down the cost of a robot goes down and provides a comfort factor which comes down to trust.