1 Introduction

Load flow study is a vital tool for power system planning and operation. However, there are many uncertainties, which result from changes in load demands, outages of generators and changes of network. Large-scale wind power integration further introduces great uncertainties into power systems. Many researchers have performed researches on applying probabilistic load flow (PLF) methods to handle these uncertainties.

A critical review was provided in [1], where the methods to solve PLF problems were classified into three types, namely simulation methods, approximation methods and analytical methods.

Monte Carlo simulation (MCS) can obtain accurate results after a large number of simulations, which are generally treated as reference results for comparisons. But, MCS is time-consuming. The application of importance sampling [2], Latin hypercube sampling (LHS) [3, 4] and Latin supercube sampling [5] reduced the computational burden of MCS. For handling correlated input random variables, Nataf transformation [6, 7] and copula function [8] were applied together with LHS. References [9,10,11] applied a quasi-Monte Carlo approach to solve PLF problems. That approach was more efficient than MCS.

The point estimate method (PEM), one kind of approximation methods, was widely applied to solve PLF problems. Reference [12] first proposed the 2m PEM to solve PLF. Reference [13] introduced a modified 2m PEM to handle correlated uncertain variables. Compared with 2m PEM, 2m+1 PEM had higher accuracy, but conducted more simulations [14]. For handling correlated uncertain variables, references [15, 16] provided a modified 2m+1 PEM based on Cholesky decomposition. Reference [17] applied 2m+1 PEM to solve probabilistic three-phase load flow for unbalanced electrical systems with wind farms. Reference [18] discussed the performance of five point estimate method (5PEM). Reference [19] proposed another approximation method, unscented transformation method (UTM), which could consider correlations of input variables. Approximation methods are generally more efficient than MCS. However, the accuracy and efficiency are sensitive to the number of input random variables.

Analytical methods do not need to run many times of simulations as Monte Carlo method does. In [20], a first-order second-moment method (FOSMM) was applied to obtain the mean and standard deviation of load flow solutions. The sequence operation methodology [21] is one of the analytical methods. It has a great advantage in terms of efficiency by sequence operation. But the sequence operation needs to meet new operation rules, which limits its application. Cumulant method (CM) is another analytical method to solve PLF and has an excellent performance on computational efficiency. In [22, 23], CM and Gram-Charlier expansion were applied to solve PLF. Reference [24] discussed the properties, advantages and deficiencies of three types of series expansions, namely Gram-Charlier, Edgeworth and Cornish-Fisher expansions. Furthermore, Cholesky decomposition [25] and joint cumulants [24, 26] were utilized to deal with correlations of input variables. Reference [27] applied Gaussian mixture approximation method to handle non-normal correlated random variables. References [28, 29] applied the maximum entropy instead of series expansions to calculate probability density functions (PDFs). It could improve the accuracy of PDFs, but required a relatively more complex process.

CM requires less computational effort than other methods. However, it may produce significant errors when input random variables have large fluctuations. The reason is that the linear relationship between input and output variables is estimated based on the linearization of load power equations at the operating point. When input random variables fluctuate away from the operating point, the relationship between input and output variables may change significantly. Reference [30] studied the error resulting from linear model and showed that the error would increase when the varying ranges of input variables increased. The wind power output can change significantly over time due to fluctuations of wind speeds. To solve PLF considering large-scale wind power integration, reference [31] divided PDFs of wind power into multiple intervals, and incorporated these intervals into the integral formulation of calculating cumulants. This method cannot handle the correlation of different input variables and is computationally complex. Our previous work [32] tried to solve probabilistic optimal power flow (P-OPF) with large fluctuations using the method of combined the traditional K-means clustering technique and CM. However, the traditional K-means performs inefficiently for large-scale systems.

To solve PLF considering large-scale wind power integration, this paper proposes a novel PLF method by combining an improved K-means clustering technique and CM. It tackles the problems that the traditional CM cannot handle input random variables with large fluctuations, and that the traditional K-means is not efficient for large-scale systems. Compared with existing methods, such as the traditional CM, 2m+1 PEM, LHS and MCS, the proposed method can achieve a better performance with consideration of both computational efficiency and accuracy. The proposed method can be used to analyze effects of uncertainties on power systems.

The rest of this paper is organized as follows: Section 2 introduces the CM for PLF formulation. The theoretical framework of the proposed method is described in Section 3. IEEE 9-bus and 118-bus test systems are modified for case studies in Section 4. Finally, conclusions are summarized in Section 5.

2 CM for PLF formulation

CM for PLF is based on the linearization of load flow equations. In the process, a set of equations, which consist of nodal power injections and line power flows, are formulated.

Let X be the vector of nodal power injections, U be the vector consisting of voltage angles at PV and PQ buses and voltage magnitudes at PQ buses, and Z be the vector of power flows in branches. AC power flow equations can be written as:

$$ \left\{ {\begin{array}{l} {\varvec{X} = \varvec{f}\left( \varvec{U} \right)} \\ {\varvec{Z} = \varvec{g}\left( \varvec{U} \right)} \\ \end{array} } \right. $$
(1)

where f(·) and g(·) are the corresponding power injection functions and corresponding line flow functions, respectively.

The linear equations can be obtained by linearizing (1) at the operating point:

$$ \left\{ {\begin{array}{l} {\varvec{U} = \varvec{U}_{0} + \varvec{J}_{0}^{ - 1} \Delta \varvec{X} = \varvec{U}_{0} + \varvec{H}_{0} \Delta \varvec{X}} \\ {\varvec{Z} = \varvec{Z}_{0} + \varvec{G}_{0} \Delta \varvec{U} = \varvec{Z}_{0} + \varvec{L}_{0} \Delta \varvec{X }} \\ \end{array} } \right. $$
(2)

where \( \varvec{U}_{0} \) and \( \varvec{Z}_{0} \) are the values of U and Z at the operating point, respectively; ∆U and ∆X are the vectors of the changes in U and X; \( \varvec{J}_{0}^{ - 1} \) is the inverse of Jacobian matrix at the operating point; \( \varvec{H}_{0} = \varvec{J}_{0}^{ - 1} \); \( \varvec{G}_{0} = \left( {{{\partial \varvec{Z}} \mathord{\left/ {\vphantom {{\partial \varvec{Z}} {\partial \varvec{U}}}} \right. \kern-0pt} {\partial \varvec{U}}}} \right)\left| {_{{\varvec{U} = \varvec{U}_{0} }} } \right. \); \( \varvec{L}_{0} = \varvec{G}_{0} \varvec{J}_{0}^{ - 1} \).

For power systems containing wind farms, fluctuations of both wind generation and load can result in uncertainties. The active wind power output can be obtained as follows [23]:

$$ P_{\text{w}} = \left\{ {\begin{array}{lll} {0 \, } \\ {{{\left( {\nu_{i} - \nu_{\text{ci}} } \right)} \mathord{\left/ {\vphantom {{\left( {\nu_{i} - \nu_{\text{ci}} } \right)} {\left( {\nu_{\text{r}} - \nu_{\text{ci}} } \right)}}} \right. \kern-0pt} {\left( {\nu_{\text{r}} - \nu_{\text{ci}} } \right)}}P_{\text{r}} } \\ {P_{\text{r}} \, } \\ {0 \, } \\ \end{array} } \right.\;\;\;\;\;\begin{array}{*{20}l} {0 < \nu_{i} < \nu_{\text{ci}} } \hfill \\ {\nu_{\text{ci}} \le \nu_{i} < \nu_{\text{r}} } \hfill \\ {\nu_{\text{r}} \le \nu_{i} < \nu_{\text{co}} } \hfill \\ {\nu_{i} \ge \nu_{\text{co}} } \hfill \\ \end{array} $$
(3)

where Pw is the active wind power output; Pr is the rated power of the wind farm; vi is the wind speed of the wind farm; vci, vr and vco are the cut-in, rated and cut-out speeds of the wind farm, respectively. Wind power output is treated as a negative load, whose power factor is kept constant [13].

Thus, \( \Delta \varvec{X} \) can be reformed as follows:

$$ \Delta \varvec{X} = \Delta \varvec{W} - \Delta \varvec{L} $$
(4)

where W is the vector consisting of active and reactive wind power outputs at corresponding buses; L is the vector consisting of active and reactive load demands at corresponding buses; \( \Delta \varvec{W} \) and \( \Delta \varvec{L} \) are the vectors of the changes in W and L.

From (2), taking a specific variable in U for example, it can be converted to a linear combination as follows:

$$ \begin{aligned} u_{i} =\, & u_{{i0}} + \sum\limits_{{j = 1}}^{{N_{X} }} {h_{{0ij}} \left( {w_{j} - w_{{j0}} } \right)} - \sum\limits_{{j = 1}}^{{N_{X} }} {h_{{0ij}} \left( {l_{j} - l_{{j0}} } \right)} \\ =\, & u_{{si0}} + \sum\limits_{{j = 1}}^{{N_{X} }} {h_{{0ij}} w_{j} } - \sum\limits_{{j = 1}}^{{N_{X} }} {h_{{0ij}} l_{j} } \\ \end{aligned} $$
(5)

where \( u_{si0} = u_{i0} - \sum\limits_{j = 1}^{{N_{X} }} {h_{0ij} w_{j0} } + \sum\limits_{j = 1}^{{N_{X} }} {h_{0ij} l_{j0} } \); \( u_{i} \) is a specific variable in U; wj is the jth variable in W; lj is the jth variable in L; ui0, wj0 and lj0 are the values of uj, wj, and lj at the operating point, respectively; h0ij is the value at row i and column j of \( \varvec{H}_{0} \); NX is the number of variables of X. For active and reactive power flows in branches, they can also be expressed as a linear combination of input variables.

According to (5), system variables can be converted to a linear combination of input random variables. Assuming the independence among input random variables, the cumulants of output random variables can be calculated by combining the cumulants of input random variables based on the property of cumulants [22]. In general, there are correlations of input random variables. In this paper, the correlations of input random variables are handled by the Cholesky decomposition algorithm [25].

3 Proposed method

The fundamental reason why the traditional CM has high errors for solving PLF of power systems containing large-scale wind power is that the wind power output can change significantly over time due to fluctuations of wind speeds. Therefore, this paper focuses on how to reduce the fluctuations of input random variables. Given the probability distribution functions and correlation coefficient matrix of input random variables, the samples of input random variables can be generated through the inverse Nataf transformation [33]. The values of input variables at the same position form one point as shown in Fig. 1, where Xi is a column vector of samples for a specific input variable (wind power or load). Then, these points are grouped into several clusters through the K-means algorithm. After clustering, the samples in each cluster have small variances. Furthermore, the proposed method adopts the law of total probability to combine the results obtained using CM for PLF in all clusters.

Fig. 1
figure 1

Samples of input variables

3.1 Improved K-means algorithm

After generating the whole samples, the K-means clustering is applied to divide the whole samples into several clusters. Each cluster has a cluster center. In fact, the cluster centers form a multi-state model for random variables. For the case with only one input random variable, such as the load at one bus, the obtained cluster centers correspond to multiple load levels. The analysis on the cluster centers can represent the analysis on the whole load samples. For the case with two input random variables, such as two wind farm outputs in the same area, the K-means clustering divides their samples into a number of clusters. Each cluster center is a combination of two wind power output levels, which has implied the correlation of these two wind power outputs. For clustering samples of more input variables, the K-means algorithm is conducted in the multi-dimensional Euclidean space. The detailed analyses are introduced in the following subsections.

3.1.1 General steps of K-means algorithm

Step 1: Select initial cluster centers, which is expressed as the matrix M0.

$$ \varvec{M}_{0} = \left[ {\begin{array}{*{20}c} {x_{11}^{0} } & {x_{12}^{0} } & \cdots & {x_{1i}^{0} } & \cdots & {x_{1n}^{0} } \\ \vdots & \vdots & {} & \vdots & {} & \vdots \\ {x_{j1}^{0} } & {x_{j2}^{0} } & \cdots & {x_{ji}^{0} } & \cdots & {x_{jn}^{0} } \\ \vdots & \vdots & {} & \vdots & {} & \vdots \\ {x_{K1}^{0} } & {x_{K2}^{0} } & \cdots & {x_{Ki}^{0} } & \cdots & {x_{Kn}^{0} } \\ \end{array} } \right] $$
(6)

where K is the number of clusters set in advance; \( x_{ji}^{0} \) is the initial center of the variable i in the jth cluster. The jth cluster center can be expressed as: \( \left( {x_{j1}^{0} ,x_{j2}^{0} , \cdots ,x_{ji}^{0} , \cdots ,x_{jn}^{0} } \right) \).

Step 2: Calculate the Euclidean distance of all points to each cluster center.

$$ E_{d} \left( {l,j} \right) = \sqrt {\sum\limits_{i = 1}^{n} {\left( {x_{li} - x_{ji}^{0} } \right)^{2} } } $$
(7)

where \( E_{d} \left( {l,j} \right) \) denotes the Euclidean distance of point l to the center of the jth cluster.

Step 3: Assign all points to the closest cluster according to the Euclidean distance, and recalculate the cluster centers.

Step 4: Repeat Steps 2 and 3 until cluster centers don’t migrate.

3.1.2 Methods for improving performance of K-means

  1. 1)

    Selection of the initial cluster centers

The clustering performance is sensitive to the initial cluster centers, so that it is important to select them. In this paper, 10% of samples are randomly selected for clustering first. The obtained cluster centers through the first clustering can reflect the locations of cluster centers for the whole samples to some extent. Then, the obtained cluster centers are used as initial cluster centers to perform K-means clustering for the whole samples.

  1. 2)

    Determination of the appropriate value of K

It is necessary to determine the number of clusters before performing K-means clustering. Therefore, the weighted average radius (WAR) is proposed to evaluate the clustering performance. The WAR can be calculated as follows:

$$ R = \sum\limits_{j = 1}^{K} {p_{j} r_{j} } $$
(8)

where R is the WAR; pj is the ratio of the number of points in the jth cluster to the number of all points; rj is the radius of the jth cluster [34].

In general, the value of WAR decreases with the increase of the number of clusters. Furthermore, the value of WAR decreases slowly once the number of clusters exceeds one value, which indicates that the quality of clustering doesn’t improve significantly once the number of clusters is larger than that value. Therefore, that value is suggested as the appropriate value of K.

  1. 3)

    Dimensionality reduction

For improving the efficiency of K-means to handle high-dimensional samples, the singular value decomposition (SVD) can be used when the number of input variables is high. X consisting of the samples of input random variables is an \( N \times n \) matrix. Carry on SVD to X:

$$ \varvec{X} = \varvec{U}_{x}\varvec{\varSigma}_{x} {\mathbf{V}}_{x}^{\text{T}} $$
(9)

where \( \varvec{\varSigma}_{x} \) is a diagonal matrix with singular values along the main diagonal; Ux and VTx are the left and right singular matrices derived by performing SVD on X, respectively.

The high-dimensional samples X can be converted to low-dimensional samples \( \varvec{X}^{'} \) as follows:

$$ \varvec{X}^{'} = \varvec{XV}_{x} \left( {1:r} \right) $$
(10)

where \( \varvec{V}_{x} \left( {1:r} \right) \) is the first r columns of Vx. The value of r is the number of singular values, whose quadratic sum exceeds 90% of the quadratic sum of all singular values [34]. It is more efficient to perform K-means on the low-dimensional samples \( \varvec{X}^{'} \) than on the high-dimensional samples X.

3.1.3 Overall procedure of improved K-means algorithm

According to the methods described in above subsections, the overall procedure of the improved K-means algorithm is shown in Fig. 2, where Nmax is the number of input variables to perform dimensionality reduction.

Fig. 2
figure 2

Overall procedure of improved K-means algorithm

3.2 Computation of final cumulants

After the K-means clustering, a number of clusters are identified. In each cluster, the cumulant method is utilized to solve PLF. Once the computation for all clusters is completed, the law of total probability is applied to combine the moments obtained in all clusters to obtain the final cumulants of output random variables for the whole samples.

Assuming y to be one of output random variables, its final cumulants can be calculated as follows:

Step 1: The cumulants obtained using CM in each cluster can be converted to the corresponding moments.

$$ \mu_{r}^{i} = \left\{ {\begin{array}{*{20}l} {k_{1}^{i} } \hfill & \quad{r = 1} \hfill \\ {k_{r}^{i} + \sum\limits_{j = 1}^{r - 1} {C_{r - 1}^{j} \mu_{j}^{i} k_{r - j}^{i} } } \hfill & \quad{r >1} \hfill \\ \end{array} } \right. $$
(11)

where \( \mu_{r}^{i} \) is the rth moment of y for the ith cluster; kir is the rth cumulant of y for the ith cluster; \( C_{r - 1}^{j} \) is the binomial coefficient, which is equal to the number of subsets of j distinct elements of r−1 elements.

Step 2: The final moments for the whole samples can be calculated according to the law of total probability.

$$ \mu_{r}^{y} = \sum\limits_{i = 1}^{K} {p_{i} } \mu_{r}^{i} $$
(12)

where \( \mu_{r}^{y} \) is the rth moment for the whole samples.

Step 3: The final cumulants for the original whole samples can be calculated as (13).

$$ k_{r}^{y} = \left\{ {\begin{array}{*{20}l} {\mu_{1}^{y} } \quad & {r = 1} \hfill\\ {\mu_{r}^{y} - \sum\limits_{j = 1}^{r - 1} {C_{r - 1}^{j} \mu_{j}^{y} k_{r - j}^{y} } } \quad & {r > 1} \hfill \\ \end{array} } \right. $$
(13)

where \( k_{r}^{y} \) is the rth cumulant of the output variable y for the whole samples.

3.3 Procedure of solving PLF using proposed method

Figure 3 shows the flow chart of the proposed method, where k is the current cluster. A five-step procedure is described as follows.

Fig. 3
figure 3

Flow chart of proposed method

Step 1: Apply the inverse Nataf transformation to generate wind speed and load samples. The wind power samples can be obtained according to (3).

Step 2: Apply the improved K-means to cluster the wind power and load samples into a number of clusters.

Step 3: In each cluster, the CM is used to solve probabilistic load flow considering correlations of wind power outputs and loads. The correlated samples are first transformed to uncorrelated samples using the Cholesky decomposition. Then, calculate the cumulants of uncorrelated samples [25]. Finally, the CM introduced in Section 2 are executed to calculate the cumulants of all output random variables.

Step 4: Calculate the final cumulants of output random variables using the method introduced in Section 3.2.

Step 5: Approximate the PDFs of output random variables using Gram-Charlier series expansion due to its good tail behavior [22].

This paper solves PLF problems for a determined network and does not consider equipment contingencies such as N−1 contingency. If equipment contingencies are required, the proposed method can be performed for each contingency. Then, the results can be combined according to the law of total probability.

4 Case study

The proposed method, namely the improved K-means based cumulant method (IKCM), is tested on modified IEEE 9-bus and 118-bus test systems [35], which are integrated with additional wind farms. Table 1 lists the particulars of wind farms. In addition, vci = 3 m/s, vr = 13 m/s, and vco = 25 m/s [33]. The wind farms are assumed to be PQ buses, whose power factors are kept constant at 0.85 lag [13]. In these two cases, MCS with 20000 samples is applied to solve PLF, and its results are treated as the benchmark to assess the accuracy and efficiency of the proposed method. In addition, the uncorrelated CM (UCM), the correlated CM (CCM), the 2m+1 PEM and LHS-based MCS are conducted for comparison purpose. The UCM does not consider correlations of input random variables. The CCM handles correlated input random variables using the Cholesky decomposition. The 2m+1 PEM is proposed in [15]. Reference [6] proposed an LHS-based PLF method and proved that it could obtain accurate results by hundreds of simulations. In this paper, the LHS-based MCS is conducted with 500 samples. The errors of cumulants and PDFs obtained using IKCM, UCM, CCM, 2m+1 PEM and LHS are measured by the indices of absolute percent error (APE) and average root mean square (ARMS), as shown in (14) and (15), respectively. The programs are developed using MATLAB and are executed on a PC with 2.6 GHz Intel (R) Core (TM) i5 duo processor and 8 GB DDR3 RAM.

$$ APE = \left| {{{\left( {r_{\text{o}} - r_{\text{MCS}} } \right)} \mathord{\left/ {\vphantom {{\left( {r_{\text{o}} - r_{\text{MCS}} } \right)} {r_{\text{MCS}} }}} \right. \kern-0pt} {r_{\text{MCS}} }}} \right| \times 100\% $$
(14)
$$ ARMS = \frac{1}{{N_{\text{p}} }}\sqrt {\sum\limits_{i = 1}^{{N_{\text{p}} }} {\left( {OM_{i} - MCS_{i} } \right)^{2} } } $$
(15)

where ro is the cumulant value obtained using different methods except MCS; rMCS is the cumulant value obtained using MCS; OMi denotes the value of the ith point on the PDFs obtained using different methods except MCS; MCSi denotes the value of the ith point on the PDFs obtained using MCS; Np is the number of points on PDFs.

Table 1 Particulars of wind farms

4.1 Case 1: modified IEEE 9-bus test system

4.1.1 Basic information

In modified IEEE 9-bus test system, all loads have constant power factors. The active load demand at each bus is modeled as a Gussian distribution, whose mean is provided in MATPOWER [35] and standard deviation is equal to 10% of its mean. Weibull distributions are used to model wind speeds. Table 2 lists the shape and scale parameters of wind speeds [36]. The correlation coefficient between loads is assumed to be 0.8, the correlation coefficient between wind speeds is assumed to be 0.76, and the correlation coefficient between the wind speed and load at the same bus is assumed to be 0.2 [33]. The PDFs of active load power and wind power at bus 7 are depicted by histograms as shown in Figs. 4 and 5.

Table 2 Parameters of wind speeds (case 1)
Fig. 4
figure 4

PDF of active load power at bus 7

Fig. 5
figure 5

PDF of wind power at bus 7

4.1.2 Performance of improved K-means clustering

According to (12), the relationship between WAR and the number of clusters can be obtained, as shown in Fig. 6.

Fig. 6
figure 6

Relationship between WAR and number of clusters

It can be observed that the WAR declines slowly after the number of clusters is more than 40. This implies that the clustering performance will not significantly improve when the number of clusters is above 40. Therefore, the K value is suggested to be 40.

Table 3 shows the clustering results of the K-means algorithm. After clustering, input random variable samples are grouped into 40 clusters. The variance can reflect the fluctuation of one random variable. For each cluster, the variance of the random variable is calculated. As a result, 40 variance values corresponding to 40 clusters are obtained. Among these 40 values, we choose the minimum, the maximum and the mean value to present the fluctuation level of each input random variable in each cluster. The chosen values are labelled as Smin, Smax and Smean, respectively. The column labelled S is the variance of a specific input random variable for the original total samples. It can be observed that the variances after the improved K-means clustering are much smaller than those for the original whole samples.

Table 3 Comparison of variances of input variables

4.1.3 Probabilistic results

Table 4 lists the results of different methods used to solve PLF problems for this test system. The results are aggregated into: VA which stands for voltage angles, VM which stands for voltage magnitudes, PL which stands for line active power flows, and QL which stands for line reactive power flows, since it is difficult to present all output variables individually. The columns labelled “εr1”, “εr2”, “εr3” and “εr4” are APE values of the first four cumulants compared with those obtained using MCS. The mean and maximum values of the APE values are shown to demonstrate the scope of APE values for a class of variables.

Table 4 Comparison of first four cumulants (case 1)

It can be observed from Table 4 that all APE values obtained using IKCM are small, which indicates that the cumulants obtained using IKCM are approximately the same as those of MCS. The worst APE value of the proposed method’s results is 59.29%, which occurs at the εr4 of VM at bus 5. However, the actual error for this VM is only − 3.66 × 10−11 p.u. This variable may mislead the comparison on APE and should not be applied to assess the performance of different methods. Compared with IKCM, the 2m+1 PEM has smaller values of εr1, but has much larger values of εr3 and εr4. The CCM has much larger values of εr1, εr2, εr3 and εr4 than IKCM. There are two points to be pointed out about the CCM. First, the values of εr3 and εr4 are much larger than the values of εr1 and εr2. Second, the results of reactive quantities (VM and QL) are worse than those of active quantities (VA and PL), which can be significantly observed from the values of εr3. The UCM has large values of εr2, εr3 and εr4. Compared with IKCM, LHS produces slightly larger errors. However, LHS has much smaller εr1 and εr2 than CCM and UCM, and has significantly smaller εr3 and εr4 than 2m+1 PEM. The second, third and fourth cumulants can reflect the variance, skewness and kurtosis of an output random variable, respectively. Therefore, the large values of εr2, εr3 and εr4 can result in distortions on PDF curves. The PDF curves of output variables (VA at PV and PQ buses, VM at PQ buses, PL and QL) are approximated using 7-order Gram-Charlier series expansion.

Figures 7 and 8 show the PDFs of the VM at bus 9 and the active power flow in line 7–8, respectively. From the comparison in Figs. 7 and 8, the PDFs of IKCM can better match MCS histograms than CCM, UCM and 2m+1 PEM. The PDFs of LHS are close to those of IKCM.

Fig. 7
figure 7

PDFs of VM at bus 9

Fig. 8
figure 8

PDFs of PL in line 7-8

Figure 9 shows the ARMS results of PDF curves of all output random variables. The box plots corresponding to IKCM are all below those corresponding to CCM, UCM, 2m+1 PEM and LHS, which indicates that the PDFs produced by IKCM can approximate those of MCS better. Comparison between LHS and IKCM shows that the ARMS values of the PDFs of LHS are also small, and that the PDFs of LHS are close to those of IKCM. It should be pointed out that in Fig. 9d, there are two exception values in the box plots of IKCM. However, the actual ARMS values are only 0.0995% and 0.1170%. It can be seen that the CCM perform worse on reactive quantities than active quantities. This characteristic can also be observed from the results of cumulants and PDFs. The reason is that reactive quantities generally have higher degree of non-linearity than active quantities. Moreover, the computation time of each method consumes and the number of deterministic power flow (DPF) calculations conducted by each method are shown in Table 5. It can be seen that the proposed method spends much less time than LHS and MCS.

Fig. 9
figure 9

Comparisons of ARMS of PDF curves for case 1

Table 5 Comparison of computation time (case 1)

An additional experiment is conducted to examine the performance of the proposed method with more clusters, where the proposed method with 60 clusters is implemented on this test system. The results indicate that the proposed method with 60 clusters is more accurate than 40 clusters. For example, the εr1, εr2, εr3 and εr4 of QL in line 1–4, which are obtained using 40 clusters, are 1.45%, 0.41%, 8.64% and 18.18%, respectively. These values obtained using 60 clusters are 1.22%, 0.30%, 5.96% and 15.97%, respectively. It can be seen that more clusters will produce more accurate results. Obviously, more clusters will require more computation time.

It can be concluded that the proposed method has higher computational accuracy than CCM, UCM, 2m+1 PEM and LHS, and is more efficient than LHS and MCS. In addition, more clusters can achieve higher accuracy at the expense of efficiency.

4.2 Case 2: modified IEEE 118-bus test system

The modified IEEE 118-bus test system is used to examine the feasibility of the proposed method for a large system with multiple wind farms. Weibull distributions are used to model wind speeds. Table 6 lists the shape and scale parameters of wind speed distributions [36]. The correlations of wind speeds at buses 17 and 30, buses 59 and 80, and buses 92 and 100 are set to be 0.88, and others are set to be 0.48. All loads have constant power factors. The active load demand at each bus is modeled as Gussian distribution, whose mean is provided in MATPOWER [35] and standard deviation is equal to 10% of its mean.

Table 6 Parameters of wind speeds (case 2)

The relationship between WAR and the number of clusters for case 2 can be obtained using (12). The number of clusters is suggested to be 40. In this test system, there are 105 input random variables, including 6 wind power outputs and 99 load demands. Therefore, the dimensionality reduction based on SVD is applied in the K-means process, where the first six singular values are selected and their sum is equal to 92.13% of the quadratic sum of all singular values. The computation times of the traditional K-means and the improved K-means with SVD are 3.08 s and 0.92 s, respectively. It can be seen that the K-means algorithm achieves an efficiency improvement through the dimensionality reduction based on SVD.

Table 7 presents the results of cumulants. Figures 10 and 11 show the PDFs of the PL in line 100–101 and the QL in line 79–80. It is of note that the proposed method has slightly large εr3 and εr4 values for very few system variables. The reason is that the Cholesky decomposition algorithm used to handle correlations has some errors for high-order cumulants when input random variables are non-normal distributions. The final PDFs of these output random variables obtained using the proposed method still have low ARMS values as shown in Fig. 12. The comparison between Table 8 and Table 5 shows that the 2m+1 PEM does not have the obvious advantage of computational efficiency over the proposed method when solving PLF problems for this large test system. This is because the 2m+1 PEM conducts 211 load flow simulations for this large test system due to 105 input random variables. According to the results for this IEEE 118-bus test system, the same conclusion that the proposed method has higher computational accuracy than CCM, UCM, 2m+1 PEM and LHS, and spends much less time than LHS and MCS, can be achieved.

Table 7 Comparison of first four cumulants (case 2)
Fig. 10
figure 10

PDFs of PL in line 100-101

Fig. 11
figure 11

PDFs of QL in line 79-80

Fig. 12
figure 12

Comparisons of ARMS of PDF curves for case 2

Table 8 Comparison of computation time (case 2)

4.3 Discussion about stability of proposed method

The proposed method is based on the clustering algorithm. Theoretically, the result of clustering is the local optimal solution, which is influenced by the initial cluster centers. In order to examine the stability of the proposed method, probabilistic power flow for the modified IEEE 118-bus test system is conducted 100 times with random initial cluster centers. The APE values of the first four cumulants of each type of variables obtained in each simulation are summed and averaged. In Table 9, the columns labelled with εr1,mean, εr2,mean, εr3,mean and εr4,mean are the mean values of APE values of the first four cumulants for 100 simulations. It can be seen that the errors in Table 9 are approximately equal to the corresponding values of the proposed method in Table 7, which demonstrates that the proposed method can achieve stable and accurate results.

Table 9 Errors of first four cumulants for 100 simulations

5 Conclusion

A novel PLF method considering large-scale wind power integration is proposed in this paper. In the process of the proposed method, an improved K-means algorithm is used to cluster the samples of input random variables, and the law of total probability is applied to combine the results obtained in each cluster. From the case studies on modified IEEE 9-bus and 118 bus test systems, some conclusions are drawn as follows:

  1. 1)

    To solve PLF considering large-scale wind power integration, the proposed method can achieve higher accuracy than traditional CM, 2m+1 PEM and LHS, and higher efficiency than LHS and MCS. In other words, the proposed method can achieve a better performance with consideration of both computational efficiency and accuracy.

  2. 2)

    More clusters will produce more accurate results at the expense of time. The suggested number of clusters should be determined in advance.

  3. 3)

    The traditional CM considering the correlation of input random variables generally has significant errors for reactive quantities.

  4. 4)

    The 2m+1 PEM has accurate results for the first two cumulants but not for the third and fourth cumulants, which results in significant errors in PDFs of output variables.

In conclusion, as the proposed method has been tested on the small and large test systems, it can provide an accurate and efficient tool for power system planning and operation with large-scale wind power.