Abstract
Motivated by a dream of building systems that could interpret speech in any language and provide international communication, I started my career during a time, when human interfaces, speech, language and vision processing were primitive, rule-based and heuristic. Programming linguistic rules into machines was fun and adequate when building the first successful speech synthesis systems (my first “AI” project as a student at MIT in 1978). But this approach to AI proved to be woefully inadequate to handle the ambiguities in a real world that would be needed for successful machine perception, translation and other important AI tasks. How could we possibly encode all the facts and knowledge in the world by introspection and programming? At Carnegie Mellon (where I did my PhD on Speech Recognition), I was naturally drawn toward early machine learning. Perceptrons, HMM’s, stochastic models and other methods offered solutions, but were still static classifiers and had to be trained on carefully labeled data and fed explicit knowledge. Neural Nets, and particularly Backpropagation were simple, yet could learn complex non-linear classifiers. They also offered the fascinating ability to develop hidden knowledge as part of their training. But they still were static pattern classifiers and had to be trained on well labeled, pre-segmented data, a requirement that I knew was unrealistic and problematic, as segmentation and sequencing were problems in themselves. To make neural nets practical for speech and vision, we needed independence from segmentation, we needed shift-invariance and sequencing. I set out to develop a shift-invariant neural network while at ATR in Japan, and we called it the Time-Delay Neural Net. It was surprisingly successful: the TDNN turned out to deliver great performance, classified patterns shift-invariantly (without segmentation) and it would also learn as a by-product acoustic-phonetic features that researchers previously attempted to discover by introspection and program laboriously into AI-systems. The first “convolutional neural network” was born.
In 1987, however, despite our early excitement, TDNN’s aka “CNN’s” did not find broad adoption for practical AI. Alternative approaches (e.g. HMM’s) -given appropriate tricks and design- could offer equivalent performance at much less computational cost, and thus NN’s were broadly derided by the research community during the 1990’s as something akin to a cult. Still, NN’s benefit of learning implicit knowledge automatically and merge it with other such hidden knowledge from other tasks, kept us going and we proposed early NN-based large vocabulary speech recognizers, face recognition and tracking, lipreading, handwriting recognition, multimodal fusion, cross-modal repair, machine translators, and many more.. They led us to develop practical, successful AI systems, and to building more than 10 successful startups.
In this talk, I will review our early neural systems, early insights, and lessons learned for science in a practical world. I will also discuss our current research and way forward.
Bio
Alexander Waibel is Professor of Computer Science at Carnegie Mellon University (USA) and at Karlsruhe Institute of Technology (Germany). He is director of the International Center for Advanced Communication Technologies. Waibel is known for work in AI, Machine Learning, Multimodal Interfaces and Speech Translation Systems. He developed the first consecutive and simultaneous speech translation systems in 1991 and 2005. Waibel proposed early Neural Network learning methods, including the TDNN, the first shift-invariant (“convolutional”) Neural Net (1987) and many multimodal interaction systems. Waibel founded/co-founded more than 10 startups, including Jibbigo, first speech translator on a phone (acquired by Facebook 2013), and Kites, simultaneous translation services (acquired by Zoom 2021). Waibel is a member of the National Academy of Sciences of Germany, a Fellow of the IEEE and of ISCA, and a Research Fellow at Zoom. He holds BS/MS/PhD degrees from MIT and CMU.
