Android开发:从JSON音频链接转文本并与Speech to Text对比方案咨询
我来帮你梳理一套可行的解决方案,分三个核心步骤搞定:把音频URL转成文本、获取Android原生Speech to Text的结果、最后做文本对比。
1. 将音频URL转换为文本
这里分两种常用方案,按需选择:
方案A:使用在线语音识别API(比如Google Cloud Speech-to-Text)
适合对识别准确率要求高,且能接受网络依赖的场景:
- 先从JSON里拿到音频链接,把音频文件下载到Android本地存储(注意权限:Android 12及以下需要
READ_EXTERNAL_STORAGE/WRITE_EXTERNAL_STORAGE,13+需要READ_MEDIA_AUDIO或者直接存到应用私有目录) - 调用API上传本地音频,获取识别结果
关键代码示例(Kotlin):
// 1. 下载音频到应用私有目录 val audioUrl = "从JSON解析出的音频链接" val tempAudioFile = File(context.filesDir, "temp_audio.wav") URL(audioUrl).openStream().use { inputStream -> tempAudioFile.outputStream().use { outputStream -> inputStream.copyTo(outputStream) } } // 2. 调用Google Cloud Speech-to-Text API(需提前在控制台启用API并配置密钥) val speechClient = SpeechClient.create() val audio = RecognitionAudio.fromFile(tempAudioFile) val config = RecognitionConfig.newBuilder() .setEncoding(RecognitionConfig.AudioEncoding.LINEAR16) .setSampleRateHertz(16000) .setLanguageCode("zh-CN") // 根据你的音频语言调整 .build() val response = speechClient.recognize(config, audio) val audioTranscript = response.resultsList.firstOrNull()?.alternativesList?.firstOrNull()?.transcript
方案B:使用离线开源库(比如Vosk)
适合无网络场景,不需要依赖第三方API:
- 下载Vosk的Android SDK和对应语言的模型(比如中文模型,放在assets目录)
- 把音频转换成Vosk要求的格式(16kHz、单声道、WAV)
- 加载模型并识别音频
关键代码示例:
// 初始化Vosk模型(模型需放在assets/model目录下) val model = Model(context, "model") val recognizer = Recognizer(model, 16000.0f) // 读取并识别音频文件 val audioFormat = AudioFormat(16000, AudioFormat.ENCODING_PCM_16BIT, AudioFormat.CHANNEL_IN_MONO) val audioInputStream = AudioInputStream(FileInputStream(tempAudioFile), audioFormat) val buffer = ByteArray(4096) var bytesRead: Int while (audioInputStream.read(buffer).also { bytesRead = it } != -1) { recognizer.acceptWaveForm(buffer, bytesRead) } val audioTranscript = recognizer.finalResult
2. 获取Android原生Speech to Text的结果
用Android系统自带的SpeechRecognizer就能实现,直接调用系统语音识别服务:
代码示例:
private fun getNativeSttResult() { val speechRecognizer = SpeechRecognizer.createSpeechRecognizer(context) val recognitionListener = object : RecognitionListener { override fun onResults(results: Bundle?) { val matches = results?.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION) val nativeSttResult = matches?.firstOrNull() // 拿到结果后保存,用于后续对比 compareText(audioTranscript, nativeSttResult) } // 必须实现的其他回调方法 override fun onError(error: Int) { // 处理识别错误,比如提示用户"识别失败" } override fun onReadyForSpeech(params: Bundle?) {} override fun onBeginningOfSpeech() {} override fun onRmsChanged(rmsdB: Float) {} override fun onBufferReceived(buffer: ByteArray?) {} override fun onEndOfSpeech() {} override fun onPartialResults(partialResults: Bundle?) {} override fun onEvent(eventType: Int, params: Bundle?) {} } speechRecognizer.setRecognitionListener(recognitionListener) val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply { putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM) putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.CHINESE) } speechRecognizer.startListening(intent) }
注意:需要添加权限android.permission.RECORD_AUDIO,并在运行时请求权限。
3. 文本结果对比
直接用equals()判断太严格,因为识别结果可能有大小写、标点、口语化差异,推荐用**编辑距离(Levenshtein Distance)**计算相似度,或者先做文本预处理再对比:
预处理+相似度计算代码
// 文本预处理:去标点、转小写、去多余空格 fun preprocessText(text: String?): String { return text?.lowercase()?.replace(Regex("[^a-zA-Z0-9\\u4e00-\\u9fa5]"), "")?.trim() ?: "" } // 计算Levenshtein编辑距离 fun calculateLevenshteinDistance(s1: String, s2: String): Int { val dp = Array(s1.length + 1) { IntArray(s2.length + 1) } for (i in 0..s1.length) dp[i][0] = i for (j in 0..s2.length) dp[0][j] = j for (i in 1..s1.length) { for (j in 1..s2.length) { val cost = if (s1[i-1] == s2[j-1]) 0 else 1 dp[i][j] = minOf( dp[i-1][j] + 1, // 删除 dp[i][j-1] + 1, // 插入 dp[i-1][j-1] + cost // 替换 ) } } return dp[s1.length][s2.length] } // 计算相似度百分比 fun calculateSimilarity(s1: String, s2: String): Double { val processedS1 = preprocessText(s1) val processedS2 = preprocessText(s2) val maxLength = maxOf(processedS1.length, processedS2.length) if (maxLength == 0) return 100.0 val distance = calculateLevenshteinDistance(processedS1, processedS2) return ((maxLength - distance).toDouble() / maxLength) * 100 } // 对比方法 fun compareText(audioTranscript: String?, nativeSttResult: String?) { val similarity = calculateSimilarity(audioTranscript ?: "", nativeSttResult ?: "") Log.d("TextComparison", "两段文本相似度: ${String.format("%.2f", similarity)}%") }
这样一套流程下来,就能完成从音频URL转文本,到和Android原生STT结果对比的全部环节了。
内容的提问来源于stack exchange,提问作者uma




