A new campaign dubbed ShadowRay 2.0 is weaponizing exposed Ray clusters to build a self‑propagating cryptomining botnet and more. Researchers at Oligo attribute the activity to an actor they track as IronErn440, who exploits a long‑known, unauthenticated Jobs API flaw (CVE‑2023‑48022) to submit and run attacker jobs across internet‑reachable Ray head nodes. The payloads are notable for using AI‑generated code and for combining stealthy Monero mining with interactive remote shells, credential theft, and DDoS capabilities — all orchestrated to spread autonomously between clusters.
What happened — the attack in plain terms
- The attacker finds Ray clusters with the Jobs API exposed to the internet and submits jobs that execute Bash/Python payloads.
- Payloads install an AI‑crafted miner (XMRig), establish persistence (cron/systemd), and hide under fake process names and deceptive file paths.
- The malware limits CPU/GPU usage to avoid immediate detection, blocks rival miners, and schedules periodic checks against a GitHub repo to pull updates.
- Beyond mining, the actor opens Python reverse shells, exfiltrates data (credentials, models, source code), and can launch DDoS attacks using tools like Sockstress.
- The campaign uses public code hosting (GitLab/GitHub) for payload delivery; Oligo observed two waves — one via GitLab (ended Nov 5) and a persistent GitHub wave (since Nov 17).
Why this is worrying
- Scale of exposure: there are hundreds of thousands of Ray servers reachable on the public internet, a vast increase since prior discoveries.
- Trusted‑by‑design blind spot: Ray was intended for trusted networks; unauthenticated APIs were not expected to face internet threats, leaving many clusters vulnerable.
- AI‑assisted malware: payloads appear to be LLM‑generated — comments, docstrings, and structure point to automated code generation, speeding development and evasion.
- Multifaceted abuse: the campaign is not limited to mining — it also steals credentials, threatens intellectual property (models, code), and provides a platform for secondary abuse (DDoS, lateral movement).
Immediate actions for Ray cluster operators (0–24 hours)
- Identify exposures
- Scan for Ray head nodes and Jobs API endpoints reachable from the public internet (default dashboard port 8265 and Jobs API endpoints).
- Block public access now
- Restrict access to the Ray dashboard and Jobs API behind firewalls, security groups, or VPNs; deny 0.0.0.0/0 listeners.
- Apply network controls
- Enforce least‑privilege network rules: allow only known admin IPs and private subnets to interact with management endpoints.
- Hunt for compromise indicators
- Look for scheduled cron jobs, new systemd units, unknown binaries, fake process names (e.g., dns‑filter), unusual outbound connections (to GitHub/GitLab or C2), and XMRig artifacts.
- Contain suspected infected clusters
- Isolate compromised nodes from the network, preserve disk images and logs, and avoid rebooting until forensic captures are made if possible.
Detection and hunting playbook
- File and process signs
- Search for XMRig binaries, renamed curl executables, or downloads disguised as PDFs; check for pythonw.exe launching unknown init.py files.
- Persistence artifacts
- Enumerate cron entries, new systemd service files, and suspicious /etc/hosts or iptables rules that block mining pools.
- Network and telemetry
- Monitor for outbound connections to public code hosting, unknown SSH/Reverse shell connections, and frequent small‑payload fetches from GitHub/GitLab.
- Resource usage patterns
- Flag long‑running high‑CPU/GPU processes using nonstandard names or repeatedly launching at cron intervals (every 15 minutes checks are reported).
- Integrity checks
- Compare running binaries and configs against golden images; scan for modified startup scripts or unexpected Python packages installed system‑wide.
Remediation and recovery steps
- Snapshot and preserve evidence before changes.
- Remove persistence: delete malicious cron jobs and systemd units, and remove dropped binaries after preserving copies for analysis.
- Rotate credentials: change any exposed keys, database credentials, or service account tokens that could be harvested from the cluster.
- Rebuild compromised nodes: where root compromise is likely, prefer rebuild from trusted images and rekey systems rather than partial cleanup.
- Harden deployments: reconfigure Ray instances to require authentication/authorization on the Jobs API and dashboard, and place clusters inside private networks or zero‑trust boundaries.
- Patch or mitigate: apply vendor/Anyscale recommendations and follow best practices for secure Ray deployments.
Longer‑term mitigations and best practices
- Never expose management APIs to the public internet: put all Ray head nodes behind private VPCs, bastion hosts, or VPNs.
- Add authentication and authorization layers: require tokens, mTLS, or API keys for job submissions and administrative actions.
- Enforce network segmentation: keep AI workloads isolated from general compute and from backups or credential stores.
- Implement continuous monitoring: detect anomalous job submissions, unusual cron jobs, and outbound connections to code hosts or unfamiliar IPs.
- Use immutable infrastructure: deploy ephemeral worker nodes that are reprovisioned from trusted images to limit persistence windows.
- Protect secrets and models: store keys and model artifacts in vaults with strict access control, not on cluster nodes.
- Threat‑model AI infra: include LLM‑assisted malware in red‑team scenarios and update incident playbooks accordingly.
Advice for platform and cloud providers
- Block abusive patterns: flag and throttle automated job submissions to unauthenticated Ray APIs detected in public clouds.
- Provide secure defaults: ship Ray and Ray‑based solutions with dashboard and Jobs API access disabled by default on public interfaces.
- Offer managed authentication: integrate native identity and access flows (OIDC, IAM) for job submission and admin actions.
- Rate‑limit and sign job submissions: require signed job manifests or allowlisting of job submitters for shared or public clusters.
Final thought
ShadowRay 2.0 illustrates how modern attackers marry old architectural assumptions with new tooling — LLMs to generate payloads and public development platforms to deliver them — to harvest compute resources at scale. The fix is straightforward in principle: treat cluster management endpoints as high‑risk assets, deny public exposure, and require explicit authentication and network separation. In practice, the remediation demands coordinated action across DevOps, platform engineering, and security teams — and it should be treated as urgent for anyone running Ray in any environment that touches the public internet.
Leave a Reply