Jump in dH at restarting of scf-cycle from density for large systems

Bug #1834949 reported by Peter Christian Schmitz on 2019-07-01
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Siesta
Status tracked in Trunk
4.1
High
Unassigned
Trunk
High
Unassigned

Bug Description

Dear Developers and Siesta community,

i have problems to converge the scf cycle of a calculation with 650 atoms due to an imprecision in the restart procedure:

After i saved the density matrix in the previous run with small dH error, which however was not yet sufficient for convergence, the restarted scf cycle always has a much higher dH error, by 1-2 orders of magnitude, from which the code has to converge down again.

Since the cluster on which i am calculating limits my calculation time to 24 hours, the error therefore accumulates despite convergence steps in each cycle, when the number of cycles is not sufficient to overcome 2 orders of magnitude in dH.

The error gets smaller when one decreases the mixing coefficient, but it is now at 0.01 and i doubt that further decrease would be beneficial to convergence. A mixing of 0.00 does NOT reproduce the state exactly after reloading.

It seems to be impossible to restart the calculation with the wavefunctions from the previous run directly. It only uses the density.
=> Could it be that the new initial wavefuncitons are too atomic / randomized / not good enough for such a large system such that i get this behavior ?
=> How can i improve the accuracy of the first restarted scf cycle ?

i am using Siesta version r764.
The problem occurs both in SOC and non-polarized calculations, but is not as strong in the latter than with SOC, leading to convergence there.

Thank you for your help,
Peter Schmitz

Related branches

summary: - imprecision in dH at restart of scf from density in r764 for large
- systems
+ Jump in dH at restarting of scf-cycle from density for large systems
Nick Papior (nickpapior) wrote :

Yes, this is a well-known problem that we have tried to pin down (for years now), but we cannot seem to really pin-point it... :(

I would suggest you decrease initial linear mixing to something like 0.0001 and then once pulay mixing enters it should be more stable.

If you can re-create a system with a fast SCF (i.e. very few atoms) it might be easier for us to dig this out? What is the smallest system you see this behaviour for?

Nick Papior (nickpapior) wrote :

Could you try this:

In the source code Src/m_new_dm.F90

Please comment out these lines (@ line 340).

I.e. change so it looks like this:

! else if ( mix_scf_first ) then
! mix_scf_first = .false.

now recompile and try again.

It would be really helpful if you could provide an output for two calculations starting from the same DM:
1) without the above change
2) with the above change

Please note that this would only be helpful if you ensure that they start from exactly the same DM file. Thanks!

Dear Mr. Papior,

thank you for your help. First, here are the 2 files for the problematic calculation. They contain:
- a NSP calculation, then a SOC calculation with Pulay mixing, and then the restart of the SOC calculation with linear mixing and 0.00 weight
- a NSP calculation, then a SOC calculation with Pulay mixing, and then the restart of the SOC calculation with linear mixing and 0.00 weight, using the modified version of the code in the last step.

I am performing the NSP calculation first to start at a smaller dH in the SOC calculations. I found that this is very beneficial for such large system. It saves alot of time, but the error also appears when i directly run a SOC calculation (see below). The used expert routine just allows to limit the number of diagonalized bands, speeding up the calculation significantly. The error also appears if this band limitation is not used (see again below).
In the second SOC calculation, i am always using the basis from the last run (*.ion.nc files) by the User.Basis flags, as the safest way to get the correct basis used in the last calculation. It has the same behavior though when i re-provide the basis manually.

The dH jump occurs in every system that is restarted, but scales with the mixing, the system size and spin degree of freedom.
Attached is also a system with less atoms, (6 units slab), first directly calculated with SOC and converged successfully, then restarted and interrupted at an earlier stage.
Also, i noticed for this system that, in an SOC calculation from scratch, when i use linear mixing, history=0 and coefficient=0, the dH does actually diverge. This should not happen..

Generally, as can be seen from the above files, the code does not reproduce the last step with dH=0 even if the mixing is set to 0, with linear mixing and 0 DM history. Would you recommend to save more DM history ? (but this should be ruled out by linear mixing compared to pulay)
Is there any possibility to reload the Hamiltonian directly, not the DM ?
I have set the SCF.Mix.First false

Maybe it could also be useful to question why there is always a printout in the output file saying
"redata: New DM Mixing Weight = 0.25"
independent on the mixing set in the input file ? There is a second line where the correct mixing is printed, but this line always appears.

Kind regards,
Peter Schmitz

Nick Papior (nickpapior) wrote :

Dear Peter,

1) I can't see in your input that you actually specify the initial linear mixing to 0? In fact, in all of your provided outputs you are actually requesting the following:

# When you specify SCF.Mixer.Weight, you are "overwriting" DM.MixingWeight
SCF.Mixer.Weight 0.3
SCF.Mix.First false

This means that you are doing an initial Linear mixing of 1! The mix first = false, means that there will be no mixing and this is why you see the big jump. (no mixing means output == new input).

If you want a smaller initial linear mixing you should do:
%block SCF.Mixers
  Pulay
%endblock SCF.Mixers

%block SCF.Mixer.Pulay
  # Mixing method
  method pulay
  variant stable

  # Mixing options
  weight 0.3000
  weight.linear 0.0001 # this determines the initial linear mixing if SCF.Mix.First is true
  history 8
%endblock SCF.Mixer.Pulay

2) As for using the expert routines, this is good! Perhaps you may be interested in doing something like:
# A negative number *adds* the given number to the number of eigenstates calculated, so it is easier
# to *control* since it is an offset independent on from the apparent number of states.
NumberOfEigenstates -10
Also, you should probably compile with elpa as that should give you some 10% faster diagonalization (I would assume). NumberOfEigenstates should also work with the ELPA-driver.
One thing to note. In your largest calculation I am wondering whether you would actually get a slightly better performance by decreasing your # of cores. You have roughly 6 orbitals per core, which means their individual workload is very (extremely!) low. My suspicion is that you could try and halve the number of cores at a very low cost in time (but huge saving in CPUTIME). This of course requires you to test this.

3) My patch should only have an effect when you have SCF.Mix.First true, so that is why you don't see any difference.

4) You write "when i use linear mixing, history=0 and coefficient=0, the dH does actually diverge. This should not happen..", I don't know what you mean with coefficient=0? Is it mixing weight?
Could you provide the input for this? My suspicion is that you are still having a finite mixing weight.

5) The history is not needed in linear, as you correctly note. So it does not matter. The main problem here is that in all your (shown) outputs your first step is using a mixing weight of 1 which effectively is a *hard* kick if you are restarting from a non-converged solution (as in your case).
I would suggest you try SCF.Mix.First true and run without my code modification, and with my code modification. However, I have heard from others that *perhaps* the SCF.Mix.First true is not the best for SOC. This is still somewhat a mystery to me.

6) I will patch 4.1 and later to disable the print-out of DM mixing weight.

Dear Mr. Papior,

thank you, this clarifies alot !!

.. It is quite counterintutive that a flag with name
"SCF.Mix.First"
does enable to actually *not* mix the first step if it is set to true together with a small weight.linear. Ok, i will try it now with the modifications.

Indeed, with "coefficient = 0" i meant the mixing weight. I have set it to 0.00 in the 24-unit calculations, thus expected no change. I was not aware that this is separate from the initial mixing. The 0.3 was not used in SOC for the large systems, only for the 6-unit calculation.

Regarding the "this should not happen" : Now it is clear based on your explanation that the code does change the initial state and therefore probably diverges. I dont have these files anymore, unfortunately, but it should be clear after the tests.

I have looked into ELPA and had written to our IT, because its installation seems to require some nontrivial steps for the compilation. But it is very nice that the code comes with these capabilities.. also MRRR (which did not give a significant speedup however, but maybe due to other reasons).
The speedup from 40*24 to 80*24 cores, within expert mode, was nearly linear, meaning it achieved twice as many cycles. After that, the scaling is bad though (the IT suggested to use even more cores but this quickly showed to be inefficient for this system, with limited number of orbitals).

If the restart works, this number can be reduced. I just tried to get enough cycles in 1 day to somehow get over the jump..

Thank you again !
Hopefully the data will support this

Nick Papior (nickpapior) wrote :

1) No, SCF.Mix.First true DOES enable mixing.
Mix refers to mixing in the previous iterations DM/H with the output of the current iteration. If false, it *only* uses the current iteration output (no mixing), and if true it also uses the previous input.

2) Ok, good you have tested the scaling!

Nick Papior (nickpapior) wrote :

Hope it works out!

Download full text (4.1 KiB)

Dear Mr. Papior,

it is still a bit strange:
In both calculations with the mixing modifications, but one with the modified code, dH still jumps

--- Without Code Modification ----
   scf: 8 -1804868.778352 -1804868.732982 -1804868.733022 0.001146 -2.812871 0.312873
   scf: 9 -1804868.749710 -1804868.735597 -1804868.735641 0.000361 -2.813709 0.234864
Geom step, scf iteration, dmax: 0 10 0.017262
redata: Recompute H after scf cycle = F
siesta: DUscf = 413.344772
        iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) dHmax(eV)
   scf: 1 -1804421.557777 -1804719.910852 -1804719.911018 0.050963 -2.853464 62.730089
   scf: 2689464237.365204228548699.015671228548699.015350 51.191287-18.050496**********
   scf: 3 -1807335.880661 -1804834.918607 -1804834.918839 51.191233 -2.792301 28.033179
   scf: 4 -1804850.731953 -1804845.865303 -1804845.865456 0.010372 -2.813543 24.742072

--- With Code Modification ----
   scf: 8 -1804868.778352 -1804868.732982 -1804868.733022 0.001146 -2.812871 0.312873
   scf: 9 -1804868.749710 -1804868.735597 -1804868.735641 0.000361 -2.813709 0.234864
Geom step, scf iteration, dmax: 0 10 0.017262
redata: Recompute H after scf cycle = F
siesta: DUscf = 413.344772
        iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) dHmax(eV)
   scf: 1 -1804421.557777 -1804719.910852 -1804719.911018 0.050963 -2.853464 62.730089
   scf: 2 -1804742.932702 -1804731.529616 -1804731.529750 0.002985 -2.849860 60.125657
   scf: 3 -1804889.097349 -1804864.557415 -1804864.557459 0.043795 -2.827967 11.121388
   scf: 4 -1804867.632473 -1804866.689589 -1804866.689632 0.003100 -2.820711 8.032126

interestingly, the 2nd Jump is less severe and thereafter it converges quickly. So that modification is clearly useful. But the problem is still there.
Concerning the mixing, Siesta prints:

ps356562: cat GST.out | grep mixing
mix.SCF: Pulay mixing = Pulay
mix.SCF: Linear mixing weight = 0.020000
mix.SCF: Pulay mixing = Pulay
mix.SCF: Linear mixing weight = 0.020000
mix.SCF: Spin-component mixing all
mix.SCF: Pulay mixing = Pulay
mix.SCF: Linear mixing weight = 0.000100
mix.SCF: Spin-component mixing all

the last step is the restarted (soc from soc) one. Apparently it applies the linear mixing correctly.

BUT:
I also tried with permanent Linear mixing, and setting both the initial mixing weight and the later mixing weight to zero.
However, this did not circumvent the initial dH jump (although, with the code modification, the 2nd dDmax jump to 51.* would probably not have happened)

        iscf Eharris(eV) E_KS(eV) FreeEng(eV) dDmax Ef(eV) dHmax(eV)
   scf: 1 -1804421.557777 -1804719.910852 -1804719.911018 0.050963 -2.853464 62.730089
   scf: 2689464237.365204228548699.015671228548699.015350 51.191287-18.050496**********
   scf: 3228548699.015671228548699.015671228548699.015...

Read more...

Nick Papior (nickpapior) wrote :

1) I will create a patch and request it to be handled in siesta

2) Your last output: NoCodeModification_CompletelyLinear_0Mixing__GST.out is very wrong. A mixing weight of 0 means that you *never* change the DM/H. Hence you will always do the same calculation over and over. :(
However, I am not surprised in this case either since without the code-modification there is an implicit linear mixing of 1. for the first iteration (regardless of your fdf-input) which explains the drastic jump.

All-in-all, the code modification works and will probably be the *new* default somehow.

Thanks for your analysis.

Dear Mr. Papior,

to 1):
Thank you for the patch and trying to solve this issue in future versions of Siesta.
However, as written above, even with the code modification a significant dH jump occurs after restarting. Even when the initial linear mixing weight is set to almost 0. The issue does not yet seem to be clarified..

to 2):
Thank you for the concern, but actually i had set the mixing weight to 0 to demonstrate the following:
- 1. as you noted, the initial dH jump is still present. Probably the second dH value would be equivalent to the first if i had used the modified code. But the higher dH value remains. This should not happen according to the documentation, because no new density would be mixed in at all, at the first step. As you now said: Without the modification the initial mixing is 1. Ok. So this can explain why the second jump is still there. But why the first ?
- 2. the dH value also stays constant after 3 steps. This should not happen either, because, if the density is not changed, the dH error between the steps should go down to 0 ! (or am i misunderstanding this ?)

From my side, the following questions remain:
- can the initial dH jump be reduced to 0 (compared to the provided density from the last run) or the dH value from the last run ?
- can the Hamiltonian be loaded directly ?

Nick Papior (nickpapior) wrote :

1) Yes, the dH jump is sort of artificial. The dH you see are calculated as this:
DM_0 = read from *.DM file
H_0 is calculated from DM_0
DM_1 is calculated from H_0
H_1 is calculated from DM_1
dH = H_1 - H_0

Subsequent iterations does this:
H = mixed from previous iterations
DM_i is calculated from H
H_i is calculated from DM_i

This is why you *always* see a finite dHmax value, regardless of mixing == 0.

2.1) Yes, the second dH value would be equivalent to the first for the code change. Note that you are mixing the Hamiltonian in your setup. If you instead mixed the DM you would see the opposite effect, dDmax is finite, but dHmax is 0.
Also, the first dHmax is explained by the above.

2.2) This is because you are mixing H, in that case you will always have the same dHmax for mixing = 0 since you start each SCF with the same H, and you always compare with the output H.

a) initial dH will only be zero in the first step IFF you mix DM and you read in the Hamiltonian that corresponds to the DM file (this can't be done currently).
b) This is an undocumented feature, "Read.H.From.File true" (flag will probably change). But the above should explain why you see the things you find. They are not strange :)

1) ok, thank you very much!
Then i had misunderstood dH as the change upon mixing, but actually it is the change upon constructing the Hamiltonian twice from a density (which is actually the DFT accuracy, correct).
Then the jump in dH at the restart must refer to an anomaly in the way how the hamiltonian is contructed / the next density DM1 is written (which must be different from what happens at later cycles / resp. earlier in the previous calculation).
Can we somehow find the bug by comparing these paths through the code in the first and in the restarted calculation, with weight.linear =0 ? Are these pathways different in the code ? (eg. regarding how history is considered, which orbital basis is used, or the previous single point calculation.. ?)

Is it correct that dH is always calculated by twice constructing a hamiltonian from the last-mixed density, or will dH, after i=1, be calculated from Hamiltonians that correspond to the original and to the next mixed DM ?
Then i could understand why dH is different in the restarted calculation compared to the previous one. But still, after the first step, it should go back to the value from the previous cycle (if mixing is 0 and it did not make any false guesses based on this dH value)..

2.*) Ok, understood, that makes sense.

a) indeed one could, just for the 1st step, set the mixing to DM and read in the .DM and .H files ? I will try if it is possible
b) thank you very much ! so i will use this feature and see if providing only the Hamiltonian, not the DM, gives a better restart dH

Nick Papior (nickpapior) wrote :

1) Yes, but only in the first SCF step. In subsequent SCF steps the Hamiltonian is the mixed Hamiltonian. The initial dH jump is, if you like, an anomaly. But I wouldn't consider it a bug. The problem is that when you do regular SCF steps you do a regular Pulay mixing. While when you restart you are *forced* to do an initial linear mixing. In this regard the linear mixing is really crude compared to the Pulay mixing. I would suspect that if you store the complete mixing history the initial step wouldn't be any different than a continued SCF calculation (this still needs to be implemented).

After the first step you cannot ensure that dH *goes back* since your initial mixing is linear vs. Pulay. However, in the case where you *only* use linear mixing I would expect all SCF dH values to be equal (except the 1st). I.e. you can compare in this latter case.

2.a) you can't change the mixing from DM to H during the SCF (if this was what you meant?)

1) Ok, this explains why, in a "normal" Pulay calculation the initial dH would jump upon restarting even when one set
Write.DM.History.NetCDF true
because the linear mixing gives different results than Pulay (in the history provided for the 2nd restarted step).

Still: Why does the the initial dH jump even occur when the initial linear mixing weight is set to 0 ? Mixing should have no impact on this value.
Yes, dH should stay constant (at the initial, "jumped" / high value) if weight.linear and mixing weight is set to 0. I am repeating this with the modified code now.

Regarding the implementation: It seems the mixing history can already be saved (flag above). A Pulay mixing routine at the restart would indeed be really nice to have (it might turn the "inevitable" initial dH error back to the value of the previous cycle)
But as long as it is not implemented :

=> Would it be beneficial to close the previous calculation with a linear mixing step (maybe even with zero mixing) ? One could then add a linear mixing block at the beginning of the restarted calculation, too, eliminating the linear / pulay mixing change during the restart. It would be shifted to later, inside the scf run

2.a) Ok. Yes it is also not a property of the mixer blocks. What remains is that one could try a permanent DM-mixing for the restarted calculation, with H provided from the previous run. Or a permanent H mixing, with H provided from the previous run (2.b). Testing this.

Nick Papior (nickpapior) wrote :

1) The initial dH jump occurs because it is sort of a linear mixing of 1 since it reads the DM, creates H, then calculates the new DM and the new H. So I would consider this a feature when mixing H. When you re-read the H it should be somewhat different.

Yes, you can save the SCF DM's, but you can't read it in yet... :(

=> I think not. I think it is better to have a *very* small linear mixing for the initial iscf=1 step, then you can bump up the weight in subsequent SCF iterations.

2) Ok.

Nick Papior (nickpapior) wrote :

Hi,

We have fixed this in later 4.1 and trunk versions.

Could you read the manual and see if it makes sense, i.e. lookup the keyword: SCF.Mix.First and SCF.Mix.First.Force for details.

They should behave as *my* fix, however without any changes to the code.

I'll mark this as resolved for now. We know that we need a restart capability from H, but we consider this another incremental work and not a bug.

Thanks!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions