• Saik0@lemmy.saik0.com
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 day ago

    They can also crawl this publically-accessible social media source for their data sets.

    Crawling would be silly. They can simply setup a lemmy node and subscribe to every other server. Activitypub crawler would be much more efficient as they wouldn’t accidentally crawl things that haven’t changed, but instead can read the activitypub updates.

    • Strawberry@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      1
      ·
      23 hours ago

      Sure but we’re in the comments section of an article about wikipedia being crawled, which is silly because they could just download a snapshot of wikipedia