Class C downlinks failing from changing rx2 data rate

Thanks for the debugging and help. I believe I was able to manually trigger this effect by changing the RX2DR in TTS for the end device from 12 to 10 (though to clarify this issue always arose on its own without my intervention). Previously confirmed downlinks were working for a device, but after I did this, I am getting the same unconfirmed downlink = scheduled, confirmed downlink = no schedule.

Something else that also seems to happen (im not 100% sure this is linked to the same issue…), is that if I go and restart one of the affected devices, it will actually receive the most recent confirmed downlink I tried to send to it. So its as if TTS is storing these message somewhere, and only upon restart does it think it can send it. Sometimes I will restart the devices well after this issue has happened (hours) and it will still receive the downlink

Do these observations work with any theories you might have?

When debugging some unintentional confirmed downlinks for a trigger happy client, it became apparent that downlinks move in to a holding area for processing. The only solution to that particular downlink loop was to use the CLI to complete reset the MAC state.

It sounds like something similar is going on here.

The closest guess I have is the dutycycle.

When my confirmed downlinks weren’t scheduled at all (likely due to dutycycle limitations), I observed the same behaviour that a reboot would result in the Class C downlink being scheduled as a Class A downlink right after the rejoin. It looks like a downlink that cannot be scheduled when requested goes into some sort of limbo until some action occurs.

You may need to reduce your downlink interval and/or add more gateways if dutycycle is indeed limiting your application.

I had reduced my RX2DR from 12 to 10, and noticed it took a little longer for this issue to occur (~3 days rather than 1-2) though that might just be a coincidence. I’ve also tried turning off “Enforce Duty Cycle” in the gateway settings that is servicing them, but that also did not resolve the issue.

Regarding duty cycle, Im not sure how this is calculated (presumably on some rolling average or regular interval?) but wouldn’t a device be able to “recover” from a state where its “used up” all its duty cycle after a certain period of time? My issue persists once it falls into this state and seems unrecoverable.

My closest observation is that if a downlink failed to be sent at the moment you request it, it may be rescheduled (see above observations between consecutive confirmed downlinks), or otherwise is goes into a limbo until an uplink occurs after which it is transmitted as a Class A downlink (likely during Rx1 which explains your initial observations in first post).

Now from my own testing, sending an unconfirmed downlink always works as long as your gateway has available dutycycle (on which point I agree on your expectations although not sure myself). If you have a TTS demo instance with Network Operation Center, you could inspect the gateway dutycycle and test a bit more.

But I’d like to request @johan if it is feasible to implement some sort of verbose feedback when scheduling a downlink. Currently, the console appears to suggest it will be transmitted even if it won’t. It would be very useful if there is any feedback like “no dutycycle, postponing until device uplink” or “delaying until previous downlink is confirmed or timeout expires”.

@comori one detail: do you use “replace the downlink queue” or “add to the queue”?

There is indeed a queue in the Network Server. If the Network Server cannot schedule downlink for whatever reason (timing, conflict, duty-cycle), it will stay in the queue. That queue is used for Class A and B/C downlink. Some scheduling errors will fail the downlink (e.g. payload too big for the data rate), because those can and should be recovered by the application.

The Network Server does not check whether the end-device has available duty-cycle for the next ClassCTimeout time to send the acknowledgment.

The Gateway Server does, by default, enforce duty-cycle for the gateway. You can change this setting if you have access to the gateway. This requires the gateway to reconnect. The Gateway Server will disconnect the gateway within 5 minutes after this and other settings that affect the connection state change. With UDP, there is no connection to disconnect, so this may take a bit longer. Regardless of the gateway protocol, you will see disconnect and connect events for the gateway after changing the duty-cycle enforcement. You should only disable duty-cycle temporarily for testing purposes.

The Network Server does not want multiple outstanding downlinks. It seems indeed that The Things Stack V3 will not send a confirmed downlink when there is any outstanding downlink (confirmed or not) within the ClassCTimeout window after the previous downlink. I think that this is technically because some unconfirmed Class C downlink may require uplink, e.g. when the Class C downlink carries MAC commands. The Things Stack does however not send MAC commands in Class C downlink.

I do understand that this is counter-intuitive, but I would have to dig into this further if we can and should change this behavior.

1 Like

Thanks for the reply & information! Would it be useful to open a GitHub issue referencing this, or would you prefer keeping the situation as is?

You can file a GitHub issue indeed at GitHub · Where software is built so we can at least keep track of it and I can have someone look into it better.

Honestly I don’t see why a confirmed downlink cannot immediately follow an unconfirmed downlink. LoRaWAN 1.0.4 disallows MAC commands in Class C downlink, and The Things Stack applies that behavior for all LoRaWAN versions, so it’s not for multiple outstanding downlinks that need to be answered (via MAC command answers or frame acknowledgment).

Thanks everyone for continuing to help with this. Just thinking out loud here…

I’ve been using “replace the downlink queue”. Though in either case the affected devices still will not have a confirmed downlink scheduled.

So based on the observed behavior, we are assuming these confirmed downlinks are being added to the queue, just not scheduled. This is true even if I am using “replace the downlink queue”. So somewhere along the chain, something is preventing it from being scheduled.

End device duty cycle limits will not affect Network Server’s ability to schedule a downlink. So the fact that these downlinks are not being scheduled, is not a direct result of the end device duty cycle limits (though could still be an indirect result…)

I have tried disabling “Enforce duty-cycle” in the settings and even manually restarted the gateway, but this did not resolve the issue.

I suppose this would make sense for confirmed downlinks. If the Network Server schedules a confirmed downlink, it will wait for the timeout period to receive confirmation from the end device.

Some additional debugging observations:

  1. I’ve disabled duty cycle enforcement on both end device and gateway, and this hasnt resolved the issue
  2. Even in the cases where I initiate an unconfirmed downlink to an effected device, and see that its scheduled by the Network Server, by looking at device logging it is not receiving the downlink

Here is where I think the problem lies
The affected end devices (RAK4631) seem to stop being able to receive downlinks. Even in cases where a downlink is scheduled (as confirmed by gateway logging), the end device doesn’t receive it. I believe this explains all the behavior we’ve seen.

  1. I cannot schedule confirmed downlinks:
    The device sends a confirmed uplink every 10 minutes → NS sends an ack downlink but doesnt get a response → NS waits 300s (5 min) for Class C timeout → Theres only a 5 min window now where a confirmed downlink could get scheduled, so maybe I can get 1 confirmed downlink scheduled → No confirmed downlink ack as the device cannot receive downlinks, so NS falls into 300s timeout → device sends another confirmed uplink → repeat cycle.
  2. Changing duty cycle has no effect:
    This is, at least not directly, a duty cycle issue, so not enforcing duty cycle will have no impact
  3. Restarting the device I receive messages:
    After end device resets, it seems like it CAN receive downlinks, so it receives whatever the NS has kept in the queue.

Now, as to why the end device cannot receive downlinks…is not clear to me. All other functionality of the device is there, and it schedules uplinks on the set period.

I think for now I may just force the device to reset every 1-2 days and see if that gives me my functionality. This is obviously not ideal but this will likely require debugging with the hardware manufacturer.

Would you all agree that the behavior we are seeing makes sense if the device is failing to receive the downlinks?

The NS won’t expect a response to a confirmed uplink Ack. If the device doesn’t get an Ack, it will repeat the uplink dependent on implementation and NbTrans.

This may impact the rest of the thinking on the flow from thereafter.

This would point to the LW stack on the device.

Can you provision something completely different to test alongside - either a vanilla LoRaMac-node based device or using the new stack from Semtech.