A multimodal speech and graphical interface for hands-free data capture and querying in MRO
< Project overview >
For manual workers in the manufacturing and construction sectors, and particularly those engaged in maintenance and inspection work, a key part of the job is often to record information about some aspect of the task they are engaged in, such as the serial number of some component they are servicing. Manual workers also need to use information in their work, for example they need access to technical manuals or service records.
In current practice, methods for getting information to and from a worker engaged in a manual task and the company information system are hugely inefficient, often requiring downing tools, removing protective clothing, moving to a computer workstation, or scribbling notes on paper to be entered in later. Such practices disrupt the job, take time and are error prone.
We developed new technology to provide manual workers with a natural, hands-free, mobile interface, which will allow a person to capture data and to recover data from enterprise information systems using a combination of their voice, a conversational ‘agent’, and visual display on a screen, such as a tablet, or AR headset.
In this focussed project the research team at the University of Sheffield worked with a British multinational engineering company and the Advanced Manufacturing Research Centre (AMRC) at the University of Sheffield to develop a demonstrator that would allow engineers to enter various component identifiers by voice as part of their machinery maintenance, repair and overhaul (MRO) operations.
The company provided access to use cases, data and staff for discussion and system evaluation; AMRC provided an avenue for further development of the technology by introducing us to additional potential interested parties.
The specific aim of this project was to work with the company to develop and evaluate a system that demonstrated a hands-free, spoken language/graphical interface to support data capture and querying by an operative on the shop floor who was engaged in maintenance, repair and overhaul (MRO).
Our long-term aim (beyond the end of the project) is to progressively develop this novel interface technology into a commercial offering that addresses the industry-wide challenge of connecting manual workers to enterprise information systems.
The principal beneficiaries of the work are the company and the University of Sheffield (TUoS).
What was done?
The main activities undertaken were:
1. Requirements gathering/use case specification.
The company and the University of Sheffield staff met and identified a data capture scenario of potential benefit to the company, where previous attempts to employ voice-based data entry had failed at a critical step – the entry of multiple, complex alphanumeric strings, such as part and serial numbers.
An analysis and review of prior work on data entry by voice was carried out. This identified issues and challenges to delivering high accuracy, real-time data entry by voice and informed our interface design.
2. Data collection
Several types of data were collected. First, recordings were made of the background noise in an MRO facility. Second, the University of Sheffield staff developed software to produce random alphanumeric identifiers in the required form and display these to human subjects, who were then recorded reading these numbers aloud.
Using this approach, around 60 participants at Sheffield and 15 at the company read these identifiers for 15–30 minutes each, generating thousands of examples of read alphanumeric identifiers. These were subsequently manually transcribed and the audio files cleaned up, in order to provide training and testing data for our speech recogniser.
3. Demonstrator development
A demonstrator was developed comprising three key components:
(a) A customised automatic speech recognition (ASR) engine, tailored to the acoustic environment of the shop floor and recognition of complex alphanumeric identifiers.
(b) A graphical interface that lets users turn a microphone on and off, see the numbers they were speaking, confirm whether or not they were correct, and correct them if wrong.
(c) An interaction manager that mediated between the user interface, the speech recognition engine and a database backend where the data entered was stored.
These three components were embedded in a client-server architecture allowing the interface to run on a lightweight mobile device (such as a phone or tablet) employed by the end user, while the ASR engine, interaction manager and database access components run on a remote server with greater compute capabilities.
4. Technology evaluation
Both off-line evaluations of the ASR component and interactive, user evaluations of the full system were carried out. The latter was planned to include end-user testing in situ at the company, but Covid unfortunately ruled this out.
We carried out a preliminary analysis of number-entry rates and reading errors based on the spoken-number recording data and the results of some simple tests involving researchers typing in numbers in similar conditions to the spoken recordings.
With the University of Sheffield team members as participants, a task-based evaluation was carried out in controlled conditions to compare user performance on a data-entry task using current practice (that is, typing) versus our voice demonstrator system.
We modified the number presentation interface so that a set of evaluation numbers could be presented to users on a phone, simulating the scenario where a user enters numbers present on a physical object.
We also developed a special-purpose evaluation system interface to support users carrying out the data-entry task in the two system conditions (typing versus voice). This interface included an automatic scoring system, where values entered were validated against a target set of numbers.
The most significant scientific result of the project was a demonstration that the complex alphanumeric component identifiers can indeed be recognised by an ASR system to a sufficient level of accuracy to make the technology a viable solution for data capture on the shop floor. The project also showed that these numbers can be entered as quickly using spoken data entry as they can by typing, and with considerably less effort.
The project has also delivered core technology integrated within an extensible architecture that provides the basis for further development and impact, as discussed below.
The project has provided insights into how to support human error correction in a multi-modal interface. Additionally, the project has transferred technical know-how and understanding to employees at the company and AMRC as well as deepening the understanding of researchers at t University of Sheffield.
Finally, a valuable dataset has been developed that may be releasable to the wider research community at a later date, once approval has been gained from the company and once any commercial advantage this dataset could give to a nascent University of Sheffield enterprise has been fully exploited.
Software demonstrator. Includes a customised automatic speech recognition system, web-based graphical user interface, audio streaming component, interaction manager and lightweight database.
Evaluation system and interface. A version of the demonstrator that has been modified and extended to support an interactive, task-based evaluation by users. This includes: an interface for entering numbers by voice and an interface for entering numbers by typing; and in each an automatic scoring system where numbers entered are compared against a validation set.
A web-based software platform for collecting spoken alphanumeric identifiers. The software randomly generates identifiers in the different forms used by the company and displays them at slightly different angles and line-broken in different ways to simulate picking up an object and having to read a serial or part number from it. The numbers are presented to participants who read them aloud. The numbers themselves are recorded in a central datastore and are subsequently aligned with the audio files of participants speaking them.
A modified version of the number presentation software designed for use in a task-based evaluation. In this case, the user selects a type of number for entry; the interface then presents in turn each of a set of target numbers to the user, so they may enter the number into the evaluation system interface by the required modality (typing, voice).
A dataset of spoken alphanumeric identifiers. Assembled using the tool described in (2) above, from participants at the University of Sheffield and the company. Includes raw audio recordings of spoken identifiers, cleaned (end-pointed) versions of the spoken identifiers, transcribed versions of the spoken identifiers and textual versions of the identifiers that participants were asked to read.
A collection of videos demonstrating people using different versions of the system interface to enter data.
A collection of presentations given by the University of Sheffield to the company detailing progress of the project, and analysis and results obtained at various stages.
Please contact project lead, Professor Rob Gaizauskas, at the University of Sheffield, to discuss access to the demonstrator code, datasets or demo videos generated in the course of this project.
The project demonstrated to the company that voice-based data entry for MRO was feasible. They were considering how best to take this insight forward (it is a challenging time for the company): they would have liked to buy a product that would do this for them, but the University of Sheffield team was not in a position to deliver such on a commercial basis at this time.
As a consequence of the technical success of the project and our deepening understanding of the potential market opportunities for this technology, we have sought ways to take technology forward towards commercialisation.
We have begun a collaboration with a Sheffield-based software development company which is interested in becoming involved with voice-based user interface work. With them we submitted an application to a KTN-iX call to deliver a voice-based query system for another multinational company, building on work we have carried out in the current project. We have been shortlisted for that opportunity and are awaiting a final decision.
One of the early career researchers on the project was sponsored by the University of Sheffield to attend the NorthXNorthwest “Lean Launch Programme”, a market discovery programme, to explore the commercial potential of our voice-based data entry technology.
As a consequence of contacts made during that programme, the University of Sheffield has identified two new sectors where there is considerable interest in voice-based data entry, namely the agricultural sector and the utility-servicing sector. We have entered into discussions with organisations in each of these sectors who expressed an interest in taking our technology forward.
In collaboration with the Sheffield software development company, we aim to seek funding support for projects in both these sectors, either directly from the companies themselves or through Innovate UK.
Independently, one of the venture capitalists who participated as a “dragon” in the final pitch session of the Lean Launch Programme has expressed an interest in talking to us about the future of our technology.
As indicated above, we plan to progress collaborations with agricultural and utility-service sector organisations to see whether funded projects can be secured to further develop the technology and take it closer to market. We also intend to submit an EPSRC proposal to explore some of the more basic research questions that work to date has thrown up.
We are engaged in on-going discussions with the University of Sheffield Commercialisation Services about how best to move towards setting up a spin-out. We plan to continue to engage with both the partner company and AMRC to nurture relationships and potential future joint activities.
We would hope that the company has learned that universities, and the University of Sheffield in particular, can deliver solutions to difficult technical problems, when the private sector cannot, and at relatively low cost. We have learned that working with a large, private sector manufacturer throws up interesting intellectual challenges and can be rewarding.
However, we have also learned that supplying a superior technical solution to a real problem does not in itself make a commercial opportunity – much more needs to be done to translate a proof-of-concept demonstrator into a commercially viable enterprise that can supply a sustainable solution to a large manufacturer.
With hindsight we would not have started a collaborative, knowledge transfer project at the outset of a global pandemic. While many activities could proceed remotely, many were slowed substantially, and some rendered impossible (for example, final user testing/evaluation was not possible due to Covid-constrained access to company facilities).
Throughout the project we kept believing this would change, but in hindsight it would have been better to have planned alternatives much earlier.
And of course, more funding would always have been useful. The multi-sub-disciplinary nature of this project, limited access to staff with the requisite skills and limited funding led to issues with aligning the right people, with the right skills at the right time.
We were forced to use multiple researchers part-time at different stages of the project due to their availability and commitments to other projects. This led to some inefficiencies in communication and handover that lengthened the project and may have led to sub-optimal solutions in places.
What has Pitch-In done for you?
This project has enabled speech and language technology workers at the University of Sheffield to validate their intuitions that voice-enabled data entry in noisy environments is not only possible but useful and sought after by real-world enterprises.
It has allowed us to build a prototype demonstrator and acquire the expertise to begin a journey towards a commercially successful spin-out enterprise, which will help to make multiple British industries more effective and efficient and further the Industry 4.0 vision.
Our industry partner has learned that a technical challenge which appeared insoluble with current technology is indeed soluble and has therefore put voice-based data capture from the shop floor back on the agenda as a technology worth investigating further.
Professor Rob Gaizauskas – the University of Sheffield
The University of Sheffield
A multinational engineering company