#wikimedia-office: TechCom RFC meeting

Meeting started by RoanKattouw at 21:03:40 UTC (full logs).

Meeting summary

  1. Job queue issues (RoanKattouw, 21:03:47)
    1. https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/Job_Queue (RoanKattouw, 21:03:54)
    2. https://etherpad.wikimedia.org/p/JobQueue-ircmeeting (DanielK_WMDE__, 21:04:01)
    3. <mobrovac> the scheduling in eventbus will be per job type, in order of ingress (DanielK_WMDE__, 21:12:55)
    4. <TimStarling> having fairness of scheduling between wikis was a deliberate design decision, IMHO important and useful (DanielK_WMDE__, 21:13:49)
    5. <Krinkle> The end use case that should remain is that if a wiki is dormant and I schedule 1 job there, it should run nearly instantly no matter what. (DanielK_WMDE__, 21:15:10)
    6. <Krinkle> _joe_: we should confirm then if the problem is the "wasting of time" on subjective unimportant jobs, or the waste on cycles checking/switching wikis. The former might be a hard sell. (DanielK_WMDE__, 21:16:28)
    7. if there are a lot of jobs being queued from a given wiki, it makes sense to defer those jobs for a while so that deduplication can take effect (DanielK_WMDE__, 21:18:57)
    8. <brion> queue size as a raw number is pretty meta... lag is human-scaled :) (DanielK_WMDE__, 21:24:55)
    9. <DanielK_WMDE__> mobrovac: a LRU list of job signatures recently seen locally, kept in memory. that would be quick-and-dirty dedupe before push. (DanielK_WMDE__, 21:25:08)
    10. <MaxSem|grrrr> why instead of deduping not just use page_touched to skip [RefreshLinksJobs] quickly? (DanielK_WMDE__, 21:27:26)
    11. <Krinkle> I believe it is spenidng most time waiting for replag. A job quue write is not complete until after we wait for all slaves to have replicated the write. (DanielK_WMDE__, 21:30:42)
    12. <Platonides> from a wiki user POV, it should be possible to see the backlog for a wiki, so that you could know "all changes made before <two weeks> have taken effect" or "wait two weeks for all transclusions to update" (legoktm, 21:30:52)
    13. waiting for replication makes sense. DB throughput is a hard limit on job execution, and should be. batching can improve that. but batching kills deduplication (DanielK_WMDE__, 21:32:32)
    14. maybe the runner could keep track of the avg execution time per job type, and consider that value for scheduling fairness. so a large job on one wiki would count for many small jobs on another wiki (DanielK_WMDE__, 21:36:30)
    15. https://github.com/wikimedia/budgeteer (gwicke, 21:36:51)
    16. <TimStarling> for third parties I think we should do like wordpress and have a cron.php which you hit from cron with curl (DanielK_WMDE__, 21:41:28)
    17. Stock MediaWiki job runner (maintenance/runJobs.php) invokes JobRunner class directly, not over HTTP. For cache and config consistency, we should consider standardising on Special:RunJobs over http. (Krinkle, 21:44:15)
    18. MediaWiki by default will run one job per web request <https://www.mediawiki.org/wiki/Manual:$wgJobRunRate> (legoktm, 21:45:23)
    19. [with kafka] there is a combination of concurrency limiting, and cost-based rate limiting; with cost being typically dominated by execution cost (DanielK_WMDE__, 21:47:40)
    20. <Krinkle> It sounds like the new stack performs an HTTP call to MediaWiki/rpc for each individual job, whereas the current model does it per batch (wiki+job type+batch limits) (DanielK_WMDE__, 21:48:19)
    21. <_joe_> So regarding deduplication, I am unsure how effective it is, because there is actually no way to tell right now <_joe_> I don't think it would be hard to compute those data <_joe_> we just don't (DanielK_WMDE__, 21:57:42)
    22. 14:54:00 <TimStarling> my test wiki now unconditionally queues 6 jobs per edit, it used to be between 0 and 1 (RoanKattouw, 21:58:15)
    23. https://grafana.wikimedia.org/dashboard/db/job-queue-rate?panelId=7&fullscreen&orgId=1 (gwicke, 22:00:30)
    24. https://github.com/wikimedia/mediawiki/commit/cb7c910ba72bdf4c2c2f5fa7e7dd307f98e5138e (Krinkle, 22:03:05)


Meeting ended at 22:05:14 UTC (full logs).

Action items

  1. (none)


People present (lines said)

  1. Krinkle (87)
  2. DanielK_WMDE__ (84)
  3. _joe_ (78)
  4. gwicke (60)
  5. TimStarling (36)
  6. brion (22)
  7. RoanKattouw (19)
  8. Platonides (13)
  9. mobrovac (13)
  10. Pchelolo (6)
  11. legoktm (5)
  12. wm-labs-meetbot (3)
  13. MaxSem|grrrr (3)
  14. bd808 (3)
  15. Scott_WUaS (2)
  16. Amir1 (2)
  17. SMalyshev (2)
  18. stashbot (1)
  19. no_justification (1)
  20. volans (1)
  21. kaldari (1)


Generated by MeetBot 0.1.4.