Web computing is a variant of parallel computing where the idle times of PCs donated by worldwide distributed users are employed to execute parallel programs. In this thesis we consider a web computing variant with two important properties: First, we support the execution of coupled, massively parallel algorithms (rather than distributed data processing). And second, we organize the system in peer-to-peer fashion. We present the Paderborn University BSP-based Web Computing (PUB-Web) library, which supports the execution of parallel programs in the bulk-synchronous style (BSP) in such a web computing setting. In this thesis, we focus on important technical and algorithmic aspects, in particular: In order to schedule processes with respect to the currently available computing power, which continually changes in an unpredictable fashion, we need intelligent load balancing algorithms and -- as a basic precondition -- the technical ability to migrate threads at runtime. To achieve the latter in a way suitable for production use, compatible with recent Java versions, available for all important platforms, and easy-to-use for developers, we develop the PadMig thread migration and checkpointing library. In order to tackle the distributed load balancing problem, we present an algorithm based on Distributed Heterogeneous Hash-Tables. In order to judge the quality of the schedules produced, we perform extensive experiments to compare several variants of the DHHT-based load balancer with the well- established Work Stealing algorithm, using realistic input data obtained by profiling the utilization of several hundred PCs for a period of several months. Beside the available computing power, we finally also consider the network bandwidth as a secondary criterion for load balancing. For this purpose, we cluster the PUB-Web network according to bandwidth, employing a novel, fault-tolerant, adaptive, and scaling distributed clustering algorithm called DiDiC. In order to judge the quality of the clusterings produces by DiDiC, we experimentally compare it to the well-established MCL algorithm using a simulator.