Synchronized Web Audio Playback on Multiple Smartphones Using Timestamped Chunks and Manual Time Shifting

Question

Note: I am not super knowledgeable web javascript or streaming, but I have read this and this and this I am proposing an alternate idea and just trying to verify whether I have a sound starting point in theory to build on.

Problem: You have an arbitrary number of people in the same place, for example, 50 people on a bike ride, and they all have phones and they all have various branded bluetooth speakers.

Wouldn't it be nice if there was a way to have everyone "tune in" to a single radio station and play the same music exactly in sync like the analog old days?

Well, Spotify has "Jams" but guess what: As far as I know, there can only be a single speaker putting out the audio. And even if there were a way to involve subsequent speakers, there is no way to ensure they are in precise sync.

Also, speaker brands like JBL have "speaker pairing" whereby speakers of the same brand can sync up... but the problem with that is that everyone would have to have the same brand speaker. Boo.

There are bluetooth multi stream transmitters but these are limited to two client devices.

There is bluetooth 5.0 but everyone and everything would need to be using that, and it still doesn't really accommodate many clients and syncing.

There are apps like Ampme that are designed for this exact problem, but... I tried Ampme and the first thing it says is "get free 3 days or pay 9.99 a week." It also very unclear what it is doing, how it is communicating or attempting to communicate with other devices... or what I am supposed to do to use it. I also don't think anyone is going to install, pay for, and learn an app just in case they happen to go on a bike ride with 50 other people someday and/or have the patience to get everything synced up.

There are other attempted solutions like this and this, but they approach things differently and are inconclusive... and frankly, in the words of Trick Daddy, "Way too advanced for this."

But if there were a simple url that people could go to and join a "stream" and it had a way of manually micro-adjusting the playhead position forward or backwards, that might be easy enough for everyone.

Let's say you have url, myurl.com, that:

Assumptions:
- All client devices are cellphones using a browser.
- Any arbitrary chosen group of modern cellphones can be assumed to have a local datetime with +/- 100ms max deviation from one-another (an amount recoverable with manual playhead position shifting).
- There is a way to schedule raw audio buffers against the system clock using Javascript in a web browser (and to do so contiguously without audio artifacts). That is, scheduling raw audio buffers similarly to using APIs such as AVAudioPlayerNode in iOS or AudioTrack in Android and NOT by using high-level playback functions.
- We are not "streaming" in the traditional sense.
When you load the url to start listening:
Audio pre-exists on the server as files -- chunks (perhaps one-second chunks) where the files are named by the timestamp at which they should start playing on the client in reference to the clients datetime setting.
When a client connects it sends its local time to the server. (There is no server to client time sync... this is just to make sure the clients clock is properly set and within a manually correctable margin of error).
The client calculates the next file to request (or set of files to request asynchronously) such as [timestamp].wav where timestamp is a time in the future that falls on a pre-determined grid -- an implicitly shared timeline grid.
The client receives this chunk soon enough to schedule and play it at exactly at its stated time in reference to the clients system clock.
There are manual skew/time correction buttons on the webpage like +/- 1,5,10, or 100ms. This means that users should literally listen to and correct the sync of their device against the rest of them (which are assumed to be in sync already).
If a client "drops a frame?" (it doesn't download a chunk soon enough to play it on time), then that persons audio would just drop out for that segment, but as long as they had already set their manual sync offset, then their audio would still be guaranteed to be in sync if and when it resumes -- the dropout may not even be noticeable.

It seems there are two basic ways to approach this:

A) Use a custom AudioWorkletProcessor and establish as precise "t0" as possible that aligns with the time grid and manually feed consecutive buffers directly to the output device. The trick to this would be accurately predicting how far in advance to set the target initial start time (the timestamp of the first chunk to download) and filling silence up to that point... and making sure there isn't buffer underrun like... ever.

B) Use AudioBufferSourceNode and schedule them to play at their respective timestamps. This is much nicer in theory and simplicity but the problem is there is no super precise scheduling that isn't prone to shifts due to blocking that is out of our control.

The manual skew would correct individual variances between clients' system datetimes and bluetooth and other hardware latencies. The skew would work differently depending on the implementation (A or B) above. If using AudioWorkletProcessor then we would have precise control and could literally shave or add samples to the buffer... which is 1/44100th of a second accuracy. If using AudioBufferSourceNode then we would just adjust the scheduling which is flaky to begin with it seems.

There could be a mechanism by which the Nth person would not begin receiving audio until the (n-1)th person had established sync (indicated that they had done so).

Remember:

There is no need for client-server time synchronization like NTP, PTP, or RTC because we are relying on the filenames of the file chunks and the clients system datetime settings to establish a baseline coarse sync.
The source of the audio "stream" isn't really relavant to the problem -- as long as it is scheduled in advance. It could be a deterministic 24-hour loop, or it could be a "live" (but delayed) stream.

Is this at least feasible in theory as a starting point?

DavidT · Answer 1 · 2024-12-24T09:38:03.507

You have made the assumption that:

modern cellphones can be assumed to have a local datetime with +/- 100ms max deviation

I am not sure that is true - however if it is not true, no matter what solution you come up with, you need a "group synchronized time" - which will be some offset from the local time on each phone.

If/once you have time synchronization, syncing the audio becomes trivial all you need to do is:

Download/cache at least 1 song in advance.
Send a "start playback" message a short time (say 10 seconds) before the song needs to start playing - the interval should be long enough to account for any network latency.

Then when the time (adjusted for the local offset) occurs you just start playing that song.

Frankly as soon as one song starts playing, the server can start sending messages to queue up the next song. Alternatively I think browsers can playback songs starting at an offset, so you could also support joining the playback midway through a song if you send "playback at offset" messages from the server.

If the list of songs is known in advance you could also pre-download them on a fast connection, so you don't need to worry about slow connections during the ride.

The second link in the OP identifies an additional problem - that two devices don't play the same audio back in exactly the same time, more specifically by the end of a song one device is slightly ahead of the other devices.

With a custom player (app) that constantly sync's to the phones real clock you could probably adjust the playback to ensure it is the same on all phones.

However I don't see how chunking the song helps you, either:

You reconstruct the chunks on the phone so you are effectively playing the whole song.
Or you are trying to schedule the playback of each chunk using the browser API - I doubt you will be able to achieve a smooth playback doing that.

TL;DR - Accounting for this playback problem I withdraw my "trivial" comment about playback - I don't think you can solve this without an app to ensure precise playback speed.

pjc50 · Answer 2 · 2024-12-24T09:47:32.673

Any arbitrary chosen group of modern cellphones can be assumed to have a local datetime with +/- 100ms max deviation from one-another (an amount recoverable with manual playhead position shifting).

This feels like something you'd want to test. Phones could have really accurate time sync - that's how CSMA/CA TDM works, after all, and they all have GPS. But I'm not sure if they actually do.

Audio exists on the server as files -- chunks (perhaps one-second chunks) where the files are named by the timestamp at which they should start playing on the client in reference to the clients datetime setting.

This is roughly how HLS works. Normally used as a video format with audio, but you can entirely ditch the video and just have an audio channel.

Audio is uncompressed (there is a direct correlation between file size and duration)

This is going to be very bad - you're already banging up against bandwidth issues, and having a whole bunch of people on the same local network or cell downloading uncompressed audio will make that worse. You just need an audio format with fixed duration (not size!) packets and proper timestamps, this is entirely feasible with MP3 or AAC.

If the bandwidth will not accommodate the data rate required by the audio format, then the client is rejected.

This is very difficult to determine other than by looking for packet loss, because you don't know how busy the cell/hotspot is.

There are manual skew/time correction buttons on the webpage like +/- 1,5,10, or 100ms. This means that users should literally listen to and correct the sync of their device against the rest of them (which are assumed to be in sync already).

I think users are likely to be bad at this; I don't think I could reliably determine whether my device was ahead or behind, let alone how much, and it's fiddly. Also sound takes about 3ms to travel 1m; I don't know how much of an issue that will be.

It would benefit from any sort of local sync mechanism, whether that's BT beacons or local wifi or even playing a calibration jingle with modulated ID information on it and listening for the delay.

I wonder whether phones let you do IP-level broadcast or multicast at all? "WiFi direct" exists but isn't quite the same thing.

score 1 · Answer 3 · answered Dec 24 '24 at 13:41

If all 50 people are going to use headphones, then "current time" as interpreted by each phone will be enough for all of them to "play together" (because little differences in sync won't matter).

On the other hand, if all 50 people will be using speakers, then you run into sound interference issues. I don't have the experience of many radios tunning the same station at the same time. I do have the experience of listening to anouncements in train stations, bus stations, airports... and the audio is unintelligible unless you're standing right beside a speaker.

Have you considered sound intereference issues?

I second others' comments:

chopping media into files is done in HLS, might want to look into it;
do use compressed audio;
non-audio experts will find it difficult to decide which sound source is ahead or behind and by how much.

If you want people to do manual fine-tunning of time-shifting the stream, one possible solution is to have another link to do this tunning, which sends just beeps. For beeps it's much much easier to recognize which one happened before or after. May be clients get different tones (pitch) to ease recognizing which one is mine.

Id est, usability issues.

Synchronized Web Audio Playback on Multiple Smartphones Using Timestamped Chunks and Manual Time Shifting

3 Answers3